Class DistributedLDAModel
Object
org.apache.spark.mllib.clustering.LDAModel
org.apache.spark.mllib.clustering.DistributedLDAModel
- All Implemented Interfaces:
Saveable
Distributed LDA model.
This model stores the inferred topics, the full training dataset, and the topic distributions.
-
Method Summary
Modifier and TypeMethodDescriptionscala.Tuple2<int[],
double[]>[] describeTopics
(int maxTermsPerTopic) Return the topics described by weighted terms.Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").Java-friendly version oftopicDistributions()
javaTopTopicsPerDocument
(int k) Java-friendly version oftopTopicsPerDocument(int)
int
k()
Number of topicsstatic DistributedLDAModel
load
(SparkContext sc, String path) double
double
logPrior()
void
save
(SparkContext sc, String path) Save this model to the given path.toLocal()
Convert model to a local model.scala.Tuple2<long[],
double[]>[] topDocumentsPerTopic
(int maxDocumentsPerTopic) Return the top documents for each topicdouble
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.For each document in the training set, return the distribution over topics for that document ("theta_doc").Inferred topics, where each topic is represented by a distribution over terms.topTopicsPerDocument
(int k) For each document, return the top k weighted topics for that document and their weights.int
Vocabulary size (number of terms or terms in the vocabulary)Methods inherited from class org.apache.spark.mllib.clustering.LDAModel
describeTopics
-
Method Details
-
load
-
k
public int k()Description copied from class:LDAModel
Number of topics -
vocabSize
public int vocabSize()Description copied from class:LDAModel
Vocabulary size (number of terms or terms in the vocabulary) -
docConcentration
Description copied from class:LDAModel
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").This is the parameter to a Dirichlet distribution.
- Specified by:
docConcentration
in classLDAModel
- Returns:
- (undocumented)
-
topicConcentration
public double topicConcentration()Description copied from class:LDAModel
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.This is the parameter to a symmetric Dirichlet distribution.
- Specified by:
topicConcentration
in classLDAModel
- Returns:
- (undocumented)
-
toLocal
Convert model to a local model. The local model stores the inferred topics but not the topic distributions for training documents.- Returns:
- (undocumented)
-
topicsMatrix
Description copied from class:LDAModel
Inferred topics, where each topic is represented by a distribution over terms. This is a matrix of size vocabSize x k, where each column is a topic. No guarantees are given about the ordering of the topics.- Specified by:
topicsMatrix
in classLDAModel
- Returns:
- (undocumented)
-
describeTopics
public scala.Tuple2<int[],double[]>[] describeTopics(int maxTermsPerTopic) Description copied from class:LDAModel
Return the topics described by weighted terms.- Specified by:
describeTopics
in classLDAModel
- Parameters:
maxTermsPerTopic
- Maximum number of terms to collect for each topic.- Returns:
- Array over topics. Each topic is represented as a pair of matching arrays: (term indices, term weights in topic). Each topic's terms are sorted in order of decreasing weight.
-
topDocumentsPerTopic
public scala.Tuple2<long[],double[]>[] topDocumentsPerTopic(int maxDocumentsPerTopic) Return the top documents for each topic- Parameters:
maxDocumentsPerTopic
- Maximum number of documents to collect for each topic.- Returns:
- Array over topics. Each element represent as a pair of matching arrays: (IDs for the documents, weights of the topic in these documents). For each topic, documents are sorted in order of decreasing topic weights.
-
topicAssignments
-
javaTopicAssignments
-
logLikelihood
public double logLikelihood() -
logPrior
public double logPrior() -
topicDistributions
For each document in the training set, return the distribution over topics for that document ("theta_doc").- Returns:
- RDD of (document ID, topic distribution) pairs
-
javaTopicDistributions
Java-friendly version oftopicDistributions()
- Returns:
- (undocumented)
-
topTopicsPerDocument
For each document, return the top k weighted topics for that document and their weights.- Parameters:
k
- (undocumented)- Returns:
- RDD of (doc ID, topic indices, topic weights)
-
javaTopTopicsPerDocument
Java-friendly version oftopTopicsPerDocument(int)
- Parameters:
k
- (undocumented)- Returns:
- (undocumented)
-
save
Description copied from interface:Saveable
Save this model to the given path.This saves: - human-readable (JSON) model metadata to path/metadata/ - Parquet formatted data to path/data/
The model may be loaded using
Loader.load
.- Parameters:
sc
- Spark context used to save model data.path
- Path specifying the directory in which to save this model. If the directory already exists, this method throws an exception.
-