public class LDA
extends Object
implements org.apache.spark.internal.Logging
Terminology: - "word" = "term": an element of the vocabulary - "token": instance of a term appearing in a document - "topic": multinomial distribution over words representing some concept
References: - Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
Constructor and Description |
---|
LDA()
Constructs a LDA instance with default parameters.
|
Modifier and Type | Method and Description |
---|---|
double |
getAlpha()
Alias for
getDocConcentration |
Vector |
getAsymmetricAlpha()
Alias for
getAsymmetricDocConcentration |
Vector |
getAsymmetricDocConcentration()
Concentration parameter (commonly named "alpha") for the prior placed on documents'
distributions over topics ("theta").
|
double |
getBeta()
Alias for
getTopicConcentration |
int |
getCheckpointInterval()
Period (in iterations) between checkpoints.
|
double |
getDocConcentration()
Concentration parameter (commonly named "alpha") for the prior placed on documents'
distributions over topics ("theta").
|
int |
getK()
Number of topics to infer, i.e., the number of soft cluster centers.
|
int |
getMaxIterations()
Maximum number of iterations allowed.
|
LDAOptimizer |
getOptimizer()
LDAOptimizer used to perform the actual calculation
|
long |
getSeed()
Random seed for cluster initialization.
|
double |
getTopicConcentration()
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics'
distributions over terms.
|
LDAModel |
run(JavaPairRDD<Long,Vector> documents)
Java-friendly version of
run() |
LDAModel |
run(RDD<scala.Tuple2<Object,Vector>> documents)
Learn an LDA model using the given dataset.
|
LDA |
setAlpha(double alpha)
Alias for
setDocConcentration() |
LDA |
setAlpha(Vector alpha)
Alias for
setDocConcentration() |
LDA |
setBeta(double beta)
Alias for
setTopicConcentration() |
LDA |
setCheckpointInterval(int checkpointInterval)
Parameter for set checkpoint interval (greater than or equal to 1) or disable checkpoint (-1).
|
LDA |
setDocConcentration(double docConcentration)
Replicates a
Double docConcentration to create a symmetric prior. |
LDA |
setDocConcentration(Vector docConcentration)
Concentration parameter (commonly named "alpha") for the prior placed on documents'
distributions over topics ("theta").
|
LDA |
setK(int k)
Set the number of topics to infer, i.e., the number of soft cluster centers.
|
LDA |
setMaxIterations(int maxIterations)
Set the maximum number of iterations allowed.
|
LDA |
setOptimizer(LDAOptimizer optimizer)
LDAOptimizer used to perform the actual calculation (default = EMLDAOptimizer)
|
LDA |
setOptimizer(String optimizerName)
Set the LDAOptimizer used to perform the actual calculation by algorithm name.
|
LDA |
setSeed(long seed)
Set the random seed for cluster initialization.
|
LDA |
setTopicConcentration(double topicConcentration)
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics'
distributions over terms.
|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
$init$, initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, initLock, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log__$eq, org$apache$spark$internal$Logging$$log_, uninitialize
public int getK()
public LDA setK(int k)
k
- (undocumented)public Vector getAsymmetricDocConcentration()
This is the parameter to a Dirichlet distribution.
public double getDocConcentration()
This method assumes the Dirichlet distribution is symmetric and can be described by a single
Double
parameter. It should fail if docConcentration is asymmetric.
public LDA setDocConcentration(Vector docConcentration)
This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization).
If set to a singleton vector Vector(-1), then docConcentration is set automatically. If set to
singleton vector Vector(t) where t != -1, then t is replicated to a vector of length k during
LDAOptimizer.initialize()
. Otherwise, the docConcentration
vector must be length k.
(default = Vector(-1) = automatic)
Optimizer-specific parameter settings: - EM - Currently only supports symmetric distributions, so all values in the vector should be the same. - Values should be greater than 1.0 - default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows from Asuncion et al. (2009), who recommend a +1 adjustment for EM. - Online - Values should be greater than or equal to 0 - default = uniformly (1.0 / k), following the implementation from here.
docConcentration
- (undocumented)public LDA setDocConcentration(double docConcentration)
Double
docConcentration to create a symmetric prior.docConcentration
- (undocumented)public Vector getAsymmetricAlpha()
getAsymmetricDocConcentration
public double getAlpha()
getDocConcentration
public LDA setAlpha(Vector alpha)
setDocConcentration()
alpha
- (undocumented)public LDA setAlpha(double alpha)
setDocConcentration()
alpha
- (undocumented)public double getTopicConcentration()
This is the parameter to a symmetric Dirichlet distribution.
public LDA setTopicConcentration(double topicConcentration)
This is the parameter to a symmetric Dirichlet distribution.
topicConcentration
- (undocumented)If set to -1, then topicConcentration is set automatically. (default = -1 = automatic)
Optimizer-specific parameter settings: - EM - Value should be greater than 1.0 - default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM. - Online - Value should be greater than or equal to 0 - default = (1.0 / k), following the implementation from here.
public double getBeta()
getTopicConcentration
public LDA setBeta(double beta)
setTopicConcentration()
beta
- (undocumented)public int getMaxIterations()
public LDA setMaxIterations(int maxIterations)
maxIterations
- (undocumented)public long getSeed()
public LDA setSeed(long seed)
seed
- (undocumented)public int getCheckpointInterval()
public LDA setCheckpointInterval(int checkpointInterval)
SparkContext
, this setting is ignored. (default = 10)
checkpointInterval
- (undocumented)SparkContext.setCheckpointDir(java.lang.String)
public LDAOptimizer getOptimizer()
public LDA setOptimizer(LDAOptimizer optimizer)
optimizer
- (undocumented)public LDA setOptimizer(String optimizerName)
optimizerName
- (undocumented)public LDAModel run(RDD<scala.Tuple2<Object,Vector>> documents)
documents
- RDD of documents, which are term (word) count vectors paired with IDs.
The term count vectors are "bags of words" with a fixed-size vocabulary
(where the vocabulary size is the length of the vector).
Document IDs must be unique and greater than or equal to 0.public LDAModel run(JavaPairRDD<Long,Vector> documents)
run()
documents
- (undocumented)