LDA (Spark 1.3.1 JavaDoc)

Object
- org.apache.spark.mllib.clustering.LDA

All Implemented Interfaces:

Logging
```
public class LDA
extends Object
implements Logging
```
:: Experimental ::
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Terminology: - "word" = "term": an element of the vocabulary - "token": instance of a term appearing in a document - "topic": multinomial distribution over words representing some concept
Currently, the underlying implementation uses Expectation-Maximization (EM), implemented according to the Asuncion et al. (2009) paper referenced below.
References: - Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003. - This class implements their "smoothed" LDA model. - Paper which clearly explains several algorithms, including EM: Asuncion, Welling, Smyth, and Teh. "On Smoothing and Inference for Topic Models." UAI, 2009.

See Also:
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation Latent Dirichlet allocation (Wikipedia)}

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class LDA.EMOptimizer
Optimizer for EM algorithm which stores data + parameter graph, plus algorithm parameters.

Nested Classes
Modifier and Type	Class and Description
`static class`	`LDA.EMOptimizer` Optimizer for EM algorithm which stores data + parameter graph, plus algorithm parameters.

Constructor Summary

Constructors
Constructor and Description

LDA()

Constructors
Constructor and Description
`LDA()`

Method Summary

Methods
Modifier and Type	Method and Description
`double`	`getAlpha()` Alias for `getDocConcentration`
`double`	`getBeta()` Alias for `getTopicConcentration`
`int`	`getCheckpointInterval()` Period (in iterations) between checkpoints.
`double`	`getDocConcentration()` Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
`int`	`getK()` Number of topics to infer.
`int`	`getMaxIterations()` Maximum number of iterations for learning.
`long`	`getSeed()` Random seed
`double`	`getTopicConcentration()` Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
`static int`	`index2term(long termIndex)`
`static boolean`	`isDocumentVertex(scala.Tuple2<Object,?> v)`
`static boolean`	`isTermVertex(scala.Tuple2<Object,?> v)`
`DistributedLDAModel`	`run(JavaPairRDD<Long,Vector> documents)` Java-friendly version of `run()`
`DistributedLDAModel`	`run(RDD<scala.Tuple2<Object,Vector>> documents)` Learn an LDA model using the given dataset.
`LDA`	`setAlpha(double alpha)` Alias for `setDocConcentration()`
`LDA`	`setBeta(double beta)` Alias for `setTopicConcentration()`
`LDA`	`setCheckpointInterval(int checkpointInterval)` Period (in iterations) between checkpoints (default = 10).
`LDA`	`setDocConcentration(double docConcentration)` Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
`LDA`	`setK(int k)` Number of topics to infer.
`LDA`	`setMaxIterations(int maxIterations)` Maximum number of iterations for learning.
`LDA`	`setSeed(long seed)` Random seed
`LDA`	`setTopicConcentration(double topicConcentration)` Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
`static long`	`term2index(int term)` Term vertex IDs are {-1, -2, ..., -vocabSize}

Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.Logging
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning

- Constructor Detail
  - LDA
```
public LDA()
```
- Method Detail
  - term2index
```
public static long term2index(int term)
```
    Term vertex IDs are {-1, -2, ..., -vocabSize}
  - index2term
```
public static int index2term(long termIndex)
```
  - isDocumentVertex
```
public static boolean isDocumentVertex(scala.Tuple2<Object,?> v)
```
  - isTermVertex
```
public static boolean isTermVertex(scala.Tuple2<Object,?> v)
```
  - getK
```
public int getK()
```
    Number of topics to infer. I.e., the number of soft cluster centers.
  - setK
```
public LDA setK(int k)
```
    Number of topics to infer. I.e., the number of soft cluster centers. (default = 10)
  - getDocConcentration
```
public double getDocConcentration()
```
    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
    This is the parameter to a symmetric Dirichlet distribution.
  - setDocConcentration
```
public LDA setDocConcentration(double docConcentration)
```
    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
    This is the parameter to a symmetric Dirichlet distribution.
    This value should be > 1.0, where larger values mean more smoothing (more regularization). If set to -1, then docConcentration is set automatically. (default = -1 = automatic)
    Automatic setting of parameter: - For EM: default = (50 / k) + 1. - The 50/k is common in LDA libraries. - The +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM.
    Note: The restriction > 1.0 may be relaxed in the future (allowing sparse solutions), but values in (0,1) are not yet supported.
  - getAlpha
```
public double getAlpha()
```
    Alias for getDocConcentration
  - setAlpha
```
public LDA setAlpha(double alpha)
```
    Alias for setDocConcentration()
  - getTopicConcentration
```
public double getTopicConcentration()
```
    Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
    This is the parameter to a symmetric Dirichlet distribution.
    Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
  - setTopicConcentration
```
public LDA setTopicConcentration(double topicConcentration)
```
    Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
    This is the parameter to a symmetric Dirichlet distribution.
    Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
    This value should be > 0.0. If set to -1, then topicConcentration is set automatically. (default = -1 = automatic)
    Automatic setting of parameter: - For EM: default = 0.1 + 1. - The 0.1 gives a small amount of smoothing. - The +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM.
    Note: The restriction > 1.0 may be relaxed in the future (allowing sparse solutions), but values in (0,1) are not yet supported.
  - getBeta
```
public double getBeta()
```
    Alias for getTopicConcentration
  - setBeta
```
public LDA setBeta(double beta)
```
    Alias for setTopicConcentration()
  - getMaxIterations
```
public int getMaxIterations()
```
    Maximum number of iterations for learning.
  - setMaxIterations
```
public LDA setMaxIterations(int maxIterations)
```
    Maximum number of iterations for learning. (default = 20)
  - getSeed
```
public long getSeed()
```
    Random seed
  - setSeed
```
public LDA setSeed(long seed)
```
    Random seed
  - getCheckpointInterval
```
public int getCheckpointInterval()
```
    Period (in iterations) between checkpoints.
  - setCheckpointInterval
```
public LDA setCheckpointInterval(int checkpointInterval)
```
    Period (in iterations) between checkpoints (default = 10). Checkpointing helps with recovery (when nodes fail). It also helps with eliminating temporary shuffle files on disk, which can be important when LDA is run for many iterations. If the checkpoint directory is not set in SparkContext, this setting is ignored.
    
    See Also:
    SparkContext.setCheckpointDir(java.lang.String)
  - run
```
public DistributedLDAModel run(RDD<scala.Tuple2<Object,Vector>> documents)
```
    Learn an LDA model using the given dataset.
    
    Parameters:
    documents - RDD of documents, which are term (word) count vectors paired with IDs. The term count vectors are "bags of words" with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and >= 0.
    
    Returns:
    Inferred LDA model
  - run
```
public DistributedLDAModel run(JavaPairRDD<Long,Vector> documents)
```
    Java-friendly version of run()

Class LDA

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class Object

Methods inherited from interface org.apache.spark.Logging

Constructor Detail

LDA

Method Detail

term2index

index2term

isDocumentVertex

isTermVertex

getK

setK

getDocConcentration

setDocConcentration

getAlpha

setAlpha

getTopicConcentration

setTopicConcentration

getBeta

setBeta

getMaxIterations

setMaxIterations

getSeed

setSeed

getCheckpointInterval

setCheckpointInterval

run

run