Class LDA
- All Implemented Interfaces:
- org.apache.spark.internal.Logging
Terminology: - "word" = "term": an element of the vocabulary - "token": instance of a term appearing in a document - "topic": multinomial distribution over words representing some concept
References: - Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
- See Also:
- 
Nested Class SummaryNested classes/interfaces inherited from interface org.apache.spark.internal.Loggingorg.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
- 
Constructor SummaryConstructors
- 
Method SummaryModifier and TypeMethodDescriptiondoublegetAlpha()Alias forgetDocConcentration()Alias forgetAsymmetricDocConcentration()Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").doublegetBeta()Alias forgetTopicConcentration()intPeriod (in iterations) between checkpoints.doubleConcentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").intgetK()Number of topics to infer, i.e., the number of soft cluster centers.intMaximum number of iterations allowed.LDAOptimizer used to perform the actual calculationlonggetSeed()Random seed for cluster initialization.doubleConcentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.run(JavaPairRDD<Long, Vector> documents) Java-friendly version ofrun()Learn an LDA model using the given dataset.setAlpha(double alpha) Alias forsetDocConcentration()Alias forsetDocConcentration()setBeta(double beta) Alias forsetTopicConcentration()setCheckpointInterval(int checkpointInterval) Parameter for set checkpoint interval (greater than or equal to 1) or disable checkpoint (-1).setDocConcentration(double docConcentration) Replicates aDoubledocConcentration to create a symmetric prior.setDocConcentration(Vector docConcentration) Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").setK(int k) Set the number of topics to infer, i.e., the number of soft cluster centers.setMaxIterations(int maxIterations) Set the maximum number of iterations allowed.setOptimizer(String optimizerName) Set the LDAOptimizer used to perform the actual calculation by algorithm name.setOptimizer(LDAOptimizer optimizer) LDAOptimizer used to perform the actual calculation (default = EMLDAOptimizer)setSeed(long seed) Set the random seed for cluster initialization.setTopicConcentration(double topicConcentration) Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.Methods inherited from class java.lang.Objectequals, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.apache.spark.internal.LogginginitializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logBasedOnLevel, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, MDC, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext
- 
Constructor Details- 
LDApublic LDA()Constructs a LDA instance with default parameters.
 
- 
- 
Method Details- 
getKpublic int getK()Number of topics to infer, i.e., the number of soft cluster centers.- Returns:
- (undocumented)
 
- 
setKSet the number of topics to infer, i.e., the number of soft cluster centers. (default = 10)- Parameters:
- k- (undocumented)
- Returns:
- (undocumented)
 
- 
getAsymmetricDocConcentrationConcentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").This is the parameter to a Dirichlet distribution. - Returns:
- (undocumented)
 
- 
getDocConcentrationpublic double getDocConcentration()Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").This method assumes the Dirichlet distribution is symmetric and can be described by a single Doubleparameter. It should fail if docConcentration is asymmetric.- Returns:
- (undocumented)
 
- 
setDocConcentrationConcentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization). If set to a singleton vector Vector(-1), then docConcentration is set automatically. If set to singleton vector Vector(t) where t != -1, then t is replicated to a vector of length k during LDAOptimizer.initialize(). Otherwise, thedocConcentrationvector must be length k. (default = Vector(-1) = automatic)Optimizer-specific parameter settings: - EM - Currently only supports symmetric distributions, so all values in the vector should be the same. - Values should be greater than 1.0 - default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows from Asuncion et al. (2009), who recommend a +1 adjustment for EM. - Online - Values should be greater than or equal to 0 - default = uniformly (1.0 / k), following the implementation from here. - Parameters:
- docConcentration- (undocumented)
- Returns:
- (undocumented)
 
- 
setDocConcentrationReplicates aDoubledocConcentration to create a symmetric prior.- Parameters:
- docConcentration- (undocumented)
- Returns:
- (undocumented)
 
- 
getAsymmetricAlphaAlias forgetAsymmetricDocConcentration()- Returns:
- (undocumented)
 
- 
getAlphapublic double getAlpha()Alias forgetDocConcentration()- Returns:
- (undocumented)
 
- 
setAlphaAlias forsetDocConcentration()- Parameters:
- alpha- (undocumented)
- Returns:
- (undocumented)
 
- 
setAlphaAlias forsetDocConcentration()- Parameters:
- alpha- (undocumented)
- Returns:
- (undocumented)
 
- 
getTopicConcentrationpublic double getTopicConcentration()Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.This is the parameter to a symmetric Dirichlet distribution. - Returns:
- (undocumented)
- Note:
- The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
 
- 
setTopicConcentrationConcentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.This is the parameter to a symmetric Dirichlet distribution. - Parameters:
- topicConcentration- (undocumented)
- Returns:
- (undocumented)
- Note:
- The topics' distributions over terms are called "beta" in the original LDA paper
 by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
 If set to -1, then topicConcentration is set automatically. (default = -1 = automatic) Optimizer-specific parameter settings: - EM - Value should be greater than 1.0 - default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM. - Online - Value should be greater than or equal to 0 - default = (1.0 / k), following the implementation from here. 
 
- 
getBetapublic double getBeta()Alias forgetTopicConcentration()- Returns:
- (undocumented)
 
- 
setBetaAlias forsetTopicConcentration()- Parameters:
- beta- (undocumented)
- Returns:
- (undocumented)
 
- 
getMaxIterationspublic int getMaxIterations()Maximum number of iterations allowed.- Returns:
- (undocumented)
 
- 
setMaxIterationsSet the maximum number of iterations allowed. (default = 20)- Parameters:
- maxIterations- (undocumented)
- Returns:
- (undocumented)
 
- 
getSeedpublic long getSeed()Random seed for cluster initialization.- Returns:
- (undocumented)
 
- 
setSeedSet the random seed for cluster initialization.- Parameters:
- seed- (undocumented)
- Returns:
- (undocumented)
 
- 
getCheckpointIntervalpublic int getCheckpointInterval()Period (in iterations) between checkpoints.- Returns:
- (undocumented)
 
- 
setCheckpointIntervalParameter for set checkpoint interval (greater than or equal to 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Checkpointing helps with recovery (when nodes fail). It also helps with eliminating temporary shuffle files on disk, which can be important when LDA is run for many iterations. If the checkpoint directory is not set inSparkContext, this setting is ignored. (default = 10)- Parameters:
- checkpointInterval- (undocumented)
- Returns:
- (undocumented)
- See Also:
 
- 
getOptimizerLDAOptimizer used to perform the actual calculation- Returns:
- (undocumented)
 
- 
setOptimizerLDAOptimizer used to perform the actual calculation (default = EMLDAOptimizer)- Parameters:
- optimizer- (undocumented)
- Returns:
- (undocumented)
 
- 
setOptimizerSet the LDAOptimizer used to perform the actual calculation by algorithm name. Currently "em", "online" are supported.- Parameters:
- optimizerName- (undocumented)
- Returns:
- (undocumented)
 
- 
runLearn an LDA model using the given dataset.- Parameters:
- documents- RDD of documents, which are term (word) count vectors paired with IDs. The term count vectors are "bags of words" with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and greater than or equal to 0.
- Returns:
- Inferred LDA model
 
- 
runJava-friendly version ofrun()- Parameters:
- documents- (undocumented)
- Returns:
- (undocumented)
 
 
-