Class LDA
- All Implemented Interfaces:
org.apache.spark.internal.Logging
Terminology: - "word" = "term": an element of the vocabulary - "token": instance of a term appearing in a document - "topic": multinomial distribution over words representing some concept
References: - Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptiondouble
getAlpha()
Alias forgetDocConcentration()
Alias forgetAsymmetricDocConcentration()
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").double
getBeta()
Alias forgetTopicConcentration()
int
Period (in iterations) between checkpoints.double
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").int
getK()
Number of topics to infer, i.e., the number of soft cluster centers.int
Maximum number of iterations allowed.LDAOptimizer used to perform the actual calculationlong
getSeed()
Random seed for cluster initialization.double
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.run
(JavaPairRDD<Long, Vector> documents) Java-friendly version ofrun()
Learn an LDA model using the given dataset.setAlpha
(double alpha) Alias forsetDocConcentration()
Alias forsetDocConcentration()
setBeta
(double beta) Alias forsetTopicConcentration()
setCheckpointInterval
(int checkpointInterval) Parameter for set checkpoint interval (greater than or equal to 1) or disable checkpoint (-1).setDocConcentration
(double docConcentration) Replicates aDouble
docConcentration to create a symmetric prior.setDocConcentration
(Vector docConcentration) Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").setK
(int k) Set the number of topics to infer, i.e., the number of soft cluster centers.setMaxIterations
(int maxIterations) Set the maximum number of iterations allowed.setOptimizer
(String optimizerName) Set the LDAOptimizer used to perform the actual calculation by algorithm name.setOptimizer
(LDAOptimizer optimizer) LDAOptimizer used to perform the actual calculation (default = EMLDAOptimizer)setSeed
(long seed) Set the random seed for cluster initialization.setTopicConcentration
(double topicConcentration) Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext
-
Constructor Details
-
LDA
public LDA()Constructs a LDA instance with default parameters.
-
-
Method Details
-
getK
public int getK()Number of topics to infer, i.e., the number of soft cluster centers.- Returns:
- (undocumented)
-
setK
Set the number of topics to infer, i.e., the number of soft cluster centers. (default = 10)- Parameters:
k
- (undocumented)- Returns:
- (undocumented)
-
getAsymmetricDocConcentration
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").This is the parameter to a Dirichlet distribution.
- Returns:
- (undocumented)
-
getDocConcentration
public double getDocConcentration()Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").This method assumes the Dirichlet distribution is symmetric and can be described by a single
Double
parameter. It should fail if docConcentration is asymmetric.- Returns:
- (undocumented)
-
setDocConcentration
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization).
If set to a singleton vector Vector(-1), then docConcentration is set automatically. If set to singleton vector Vector(t) where t != -1, then t is replicated to a vector of length k during
LDAOptimizer.initialize()
. Otherwise, thedocConcentration
vector must be length k. (default = Vector(-1) = automatic)Optimizer-specific parameter settings: - EM - Currently only supports symmetric distributions, so all values in the vector should be the same. - Values should be greater than 1.0 - default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows from Asuncion et al. (2009), who recommend a +1 adjustment for EM. - Online - Values should be greater than or equal to 0 - default = uniformly (1.0 / k), following the implementation from here.
- Parameters:
docConcentration
- (undocumented)- Returns:
- (undocumented)
-
setDocConcentration
Replicates aDouble
docConcentration to create a symmetric prior.- Parameters:
docConcentration
- (undocumented)- Returns:
- (undocumented)
-
getAsymmetricAlpha
Alias forgetAsymmetricDocConcentration()
- Returns:
- (undocumented)
-
getAlpha
public double getAlpha()Alias forgetDocConcentration()
- Returns:
- (undocumented)
-
setAlpha
Alias forsetDocConcentration()
- Parameters:
alpha
- (undocumented)- Returns:
- (undocumented)
-
setAlpha
Alias forsetDocConcentration()
- Parameters:
alpha
- (undocumented)- Returns:
- (undocumented)
-
getTopicConcentration
public double getTopicConcentration()Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.This is the parameter to a symmetric Dirichlet distribution.
- Returns:
- (undocumented)
- Note:
- The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
-
setTopicConcentration
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.This is the parameter to a symmetric Dirichlet distribution.
- Parameters:
topicConcentration
- (undocumented)- Returns:
- (undocumented)
- Note:
- The topics' distributions over terms are called "beta" in the original LDA paper
by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
If set to -1, then topicConcentration is set automatically. (default = -1 = automatic)
Optimizer-specific parameter settings: - EM - Value should be greater than 1.0 - default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM. - Online - Value should be greater than or equal to 0 - default = (1.0 / k), following the implementation from here.
-
getBeta
public double getBeta()Alias forgetTopicConcentration()
- Returns:
- (undocumented)
-
setBeta
Alias forsetTopicConcentration()
- Parameters:
beta
- (undocumented)- Returns:
- (undocumented)
-
getMaxIterations
public int getMaxIterations()Maximum number of iterations allowed.- Returns:
- (undocumented)
-
setMaxIterations
Set the maximum number of iterations allowed. (default = 20)- Parameters:
maxIterations
- (undocumented)- Returns:
- (undocumented)
-
getSeed
public long getSeed()Random seed for cluster initialization.- Returns:
- (undocumented)
-
setSeed
Set the random seed for cluster initialization.- Parameters:
seed
- (undocumented)- Returns:
- (undocumented)
-
getCheckpointInterval
public int getCheckpointInterval()Period (in iterations) between checkpoints.- Returns:
- (undocumented)
-
setCheckpointInterval
Parameter for set checkpoint interval (greater than or equal to 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Checkpointing helps with recovery (when nodes fail). It also helps with eliminating temporary shuffle files on disk, which can be important when LDA is run for many iterations. If the checkpoint directory is not set inSparkContext
, this setting is ignored. (default = 10)- Parameters:
checkpointInterval
- (undocumented)- Returns:
- (undocumented)
- See Also:
-
getOptimizer
LDAOptimizer used to perform the actual calculation- Returns:
- (undocumented)
-
setOptimizer
LDAOptimizer used to perform the actual calculation (default = EMLDAOptimizer)- Parameters:
optimizer
- (undocumented)- Returns:
- (undocumented)
-
setOptimizer
Set the LDAOptimizer used to perform the actual calculation by algorithm name. Currently "em", "online" are supported.- Parameters:
optimizerName
- (undocumented)- Returns:
- (undocumented)
-
run
Learn an LDA model using the given dataset.- Parameters:
documents
- RDD of documents, which are term (word) count vectors paired with IDs. The term count vectors are "bags of words" with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and greater than or equal to 0.- Returns:
- Inferred LDA model
-
run
Java-friendly version ofrun()
- Parameters:
documents
- (undocumented)- Returns:
- (undocumented)
-