org.apache.spark.mllib.clustering
Class LDA

Object
  extended by org.apache.spark.mllib.clustering.LDA
All Implemented Interfaces:
Logging

public class LDA
extends Object
implements Logging

:: Experimental ::

Latent Dirichlet Allocation (LDA), a topic model designed for text documents.

Terminology: - "word" = "term": an element of the vocabulary - "token": instance of a term appearing in a document - "topic": multinomial distribution over words representing some concept

References: - Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.

See Also:
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation Latent Dirichlet allocation (Wikipedia)}

Constructor Summary
LDA()
           
 
Method Summary
 double getAlpha()
          Alias for getDocConcentration
 double getBeta()
          Alias for getTopicConcentration
 int getCheckpointInterval()
          Period (in iterations) between checkpoints.
 double getDocConcentration()
          Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
 int getK()
          Number of topics to infer.
 int getMaxIterations()
          Maximum number of iterations for learning.
 LDAOptimizer getOptimizer()
          :: DeveloperApi ::
 long getSeed()
          Random seed
 double getTopicConcentration()
          Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
 LDAModel run(JavaPairRDD<Long,Vector> documents)
          Java-friendly version of run()
 LDAModel run(RDD<scala.Tuple2<Object,Vector>> documents)
          Learn an LDA model using the given dataset.
 LDA setAlpha(double alpha)
          Alias for setDocConcentration()
 LDA setBeta(double beta)
          Alias for setTopicConcentration()
 LDA setCheckpointInterval(int checkpointInterval)
          Period (in iterations) between checkpoints (default = 10).
 LDA setDocConcentration(double docConcentration)
          Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
 LDA setK(int k)
          Number of topics to infer.
 LDA setMaxIterations(int maxIterations)
          Maximum number of iterations for learning.
 LDA setOptimizer(LDAOptimizer optimizer)
          :: DeveloperApi ::
 LDA setOptimizer(String optimizerName)
          Set the LDAOptimizer used to perform the actual calculation by algorithm name.
 LDA setSeed(long seed)
          Random seed
 LDA setTopicConcentration(double topicConcentration)
          Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
 
Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.spark.Logging
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning
 

Constructor Detail

LDA

public LDA()
Method Detail

getK

public int getK()
Number of topics to infer. I.e., the number of soft cluster centers.

Returns:
(undocumented)

setK

public LDA setK(int k)
Number of topics to infer. I.e., the number of soft cluster centers. (default = 10)

Parameters:
k - (undocumented)
Returns:
(undocumented)

getDocConcentration

public double getDocConcentration()
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

This is the parameter to a symmetric Dirichlet distribution.

Returns:
(undocumented)

setDocConcentration

public LDA setDocConcentration(double docConcentration)
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

This is the parameter to a symmetric Dirichlet distribution, where larger values mean more smoothing (more regularization).

If set to -1, then docConcentration is set automatically. (default = -1 = automatic)

Optimizer-specific parameter settings: - EM - Value should be > 1.0 - default = (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM. - Online - Value should be >= 0 - default = (1.0 / k), following the implementation from https://github.com/Blei-Lab/onlineldavb.

Parameters:
docConcentration - (undocumented)
Returns:
(undocumented)

getAlpha

public double getAlpha()
Alias for getDocConcentration


setAlpha

public LDA setAlpha(double alpha)
Alias for setDocConcentration()


getTopicConcentration

public double getTopicConcentration()
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

This is the parameter to a symmetric Dirichlet distribution.

Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.

Returns:
(undocumented)

setTopicConcentration

public LDA setTopicConcentration(double topicConcentration)
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

This is the parameter to a symmetric Dirichlet distribution.

Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.

If set to -1, then topicConcentration is set automatically. (default = -1 = automatic)

Optimizer-specific parameter settings: - EM - Value should be > 1.0 - default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM. - Online - Value should be >= 0 - default = (1.0 / k), following the implementation from https://github.com/Blei-Lab/onlineldavb.

Parameters:
topicConcentration - (undocumented)
Returns:
(undocumented)

getBeta

public double getBeta()
Alias for getTopicConcentration


setBeta

public LDA setBeta(double beta)
Alias for setTopicConcentration()


getMaxIterations

public int getMaxIterations()
Maximum number of iterations for learning.

Returns:
(undocumented)

setMaxIterations

public LDA setMaxIterations(int maxIterations)
Maximum number of iterations for learning. (default = 20)

Parameters:
maxIterations - (undocumented)
Returns:
(undocumented)

getSeed

public long getSeed()
Random seed


setSeed

public LDA setSeed(long seed)
Random seed


getCheckpointInterval

public int getCheckpointInterval()
Period (in iterations) between checkpoints.

Returns:
(undocumented)

setCheckpointInterval

public LDA setCheckpointInterval(int checkpointInterval)
Period (in iterations) between checkpoints (default = 10). Checkpointing helps with recovery (when nodes fail). It also helps with eliminating temporary shuffle files on disk, which can be important when LDA is run for many iterations. If the checkpoint directory is not set in SparkContext, this setting is ignored.

Parameters:
checkpointInterval - (undocumented)
Returns:
(undocumented)
See Also:
SparkContext.setCheckpointDir(java.lang.String)

getOptimizer

public LDAOptimizer getOptimizer()
:: DeveloperApi ::

LDAOptimizer used to perform the actual calculation

Returns:
(undocumented)

setOptimizer

public LDA setOptimizer(LDAOptimizer optimizer)
:: DeveloperApi ::

LDAOptimizer used to perform the actual calculation (default = EMLDAOptimizer)

Parameters:
optimizer - (undocumented)
Returns:
(undocumented)

setOptimizer

public LDA setOptimizer(String optimizerName)
Set the LDAOptimizer used to perform the actual calculation by algorithm name. Currently "em", "online" are supported.

Parameters:
optimizerName - (undocumented)
Returns:
(undocumented)

run

public LDAModel run(RDD<scala.Tuple2<Object,Vector>> documents)
Learn an LDA model using the given dataset.

Parameters:
documents - RDD of documents, which are term (word) count vectors paired with IDs. The term count vectors are "bags of words" with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and >= 0.
Returns:
Inferred LDA model

run

public LDAModel run(JavaPairRDD<Long,Vector> documents)
Java-friendly version of run()