public class LDA extends Object implements Logging
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Terminology: - "word" = "term": an element of the vocabulary - "token": instance of a term appearing in a document - "topic": multinomial distribution over words representing some concept
Currently, the underlying implementation uses Expectation-Maximization (EM), implemented according to the Asuncion et al. (2009) paper referenced below.
References: - Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003. - This class implements their "smoothed" LDA model. - Paper which clearly explains several algorithms, including EM: Asuncion, Welling, Smyth, and Teh. "On Smoothing and Inference for Topic Models." UAI, 2009.
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation Latent Dirichlet allocation
(Wikipedia)}
Modifier and Type | Class and Description |
---|---|
static class |
LDA.EMOptimizer
Optimizer for EM algorithm which stores data + parameter graph, plus algorithm parameters.
|
Constructor and Description |
---|
LDA() |
Modifier and Type | Method and Description |
---|---|
double |
getAlpha()
Alias for
getDocConcentration |
double |
getBeta()
Alias for
getTopicConcentration |
int |
getCheckpointInterval()
Period (in iterations) between checkpoints.
|
double |
getDocConcentration()
Concentration parameter (commonly named "alpha") for the prior placed on documents'
distributions over topics ("theta").
|
int |
getK()
Number of topics to infer.
|
int |
getMaxIterations()
Maximum number of iterations for learning.
|
long |
getSeed()
Random seed
|
double |
getTopicConcentration()
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics'
distributions over terms.
|
static int |
index2term(long termIndex) |
static boolean |
isDocumentVertex(scala.Tuple2<Object,?> v) |
static boolean |
isTermVertex(scala.Tuple2<Object,?> v) |
DistributedLDAModel |
run(JavaPairRDD<Long,Vector> documents)
Java-friendly version of
run() |
DistributedLDAModel |
run(RDD<scala.Tuple2<Object,Vector>> documents)
Learn an LDA model using the given dataset.
|
LDA |
setAlpha(double alpha)
Alias for
setDocConcentration() |
LDA |
setBeta(double beta)
Alias for
setTopicConcentration() |
LDA |
setCheckpointInterval(int checkpointInterval)
Period (in iterations) between checkpoints (default = 10).
|
LDA |
setDocConcentration(double docConcentration)
Concentration parameter (commonly named "alpha") for the prior placed on documents'
distributions over topics ("theta").
|
LDA |
setK(int k)
Number of topics to infer.
|
LDA |
setMaxIterations(int maxIterations)
Maximum number of iterations for learning.
|
LDA |
setSeed(long seed)
Random seed
|
LDA |
setTopicConcentration(double topicConcentration)
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics'
distributions over terms.
|
static long |
term2index(int term)
Term vertex IDs are {-1, -2, ..., -vocabSize}
|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning
public static long term2index(int term)
public static int index2term(long termIndex)
public static boolean isDocumentVertex(scala.Tuple2<Object,?> v)
public static boolean isTermVertex(scala.Tuple2<Object,?> v)
public int getK()
public LDA setK(int k)
public double getDocConcentration()
This is the parameter to a symmetric Dirichlet distribution.
public LDA setDocConcentration(double docConcentration)
This is the parameter to a symmetric Dirichlet distribution.
This value should be > 1.0, where larger values mean more smoothing (more regularization). If set to -1, then docConcentration is set automatically. (default = -1 = automatic)
Automatic setting of parameter: - For EM: default = (50 / k) + 1. - The 50/k is common in LDA libraries. - The +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM.
Note: The restriction > 1.0 may be relaxed in the future (allowing sparse solutions), but values in (0,1) are not yet supported.
public double getAlpha()
getDocConcentration
public LDA setAlpha(double alpha)
setDocConcentration()
public double getTopicConcentration()
This is the parameter to a symmetric Dirichlet distribution.
Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
public LDA setTopicConcentration(double topicConcentration)
This is the parameter to a symmetric Dirichlet distribution.
Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
This value should be > 0.0. If set to -1, then topicConcentration is set automatically. (default = -1 = automatic)
Automatic setting of parameter: - For EM: default = 0.1 + 1. - The 0.1 gives a small amount of smoothing. - The +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM.
Note: The restriction > 1.0 may be relaxed in the future (allowing sparse solutions), but values in (0,1) are not yet supported.
public double getBeta()
getTopicConcentration
public LDA setBeta(double beta)
setTopicConcentration()
public int getMaxIterations()
public LDA setMaxIterations(int maxIterations)
public long getSeed()
public LDA setSeed(long seed)
public int getCheckpointInterval()
public LDA setCheckpointInterval(int checkpointInterval)
SparkContext
, this setting is ignored.
public DistributedLDAModel run(RDD<scala.Tuple2<Object,Vector>> documents)
documents
- RDD of documents, which are term (word) count vectors paired with IDs.
The term count vectors are "bags of words" with a fixed-size vocabulary
(where the vocabulary size is the length of the vector).
Document IDs must be unique and >= 0.public DistributedLDAModel run(JavaPairRDD<Long,Vector> documents)
run()