LDA

class pyspark.mllib.clustering.LDA[source]

Train Latent Dirichlet Allocation (LDA) model.

New in version 1.5.0.

Methods

train(rdd[, k, maxIterations, …])

Train a LDA model.

Methods Documentation

classmethod train(rdd, k=10, maxIterations=20, docConcentration=- 1.0, topicConcentration=- 1.0, seed=None, checkpointInterval=10, optimizer='em')[source]

Train a LDA model.

New in version 1.5.0.

Parameters
rddpyspark.RDD

RDD of documents, which are tuples of document IDs and term (word) count vectors. The term count vectors are “bags of words” with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and >= 0.

kint, optional

Number of topics to infer, i.e., the number of soft cluster centers. (default: 10)

maxIterationsint, optional

Maximum number of iterations allowed. (default: 20)

docConcentrationfloat, optional

Concentration parameter (commonly named “alpha”) for the prior placed on documents’ distributions over topics (“theta”). (default: -1.0)

topicConcentrationfloat, optional

Concentration parameter (commonly named “beta” or “eta”) for the prior placed on topics’ distributions over terms. (default: -1.0)

seedint, optional

Random seed for cluster initialization. Set as None to generate seed based on system time. (default: None)

checkpointIntervalint, optional

Period (in iterations) between checkpoints. (default: 10)

optimizerstr, optional

LDAOptimizer used to perform the actual calculation. Currently “em”, “online” are supported. (default: “em”)