spark.lda {SparkR}  R Documentation 
spark.lda
fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
summary
to get a summary of the fitted LDA model, spark.posterior
to compute
posterior probabilities on new data, spark.perplexity
to compute log perplexity on new
data and write.ml
/read.ml
to save/load fitted models.
spark.lda(data, ...) spark.posterior(object, newData) spark.perplexity(object, data) ## S4 method for signature 'SparkDataFrame' spark.lda(data, features = "features", k = 10, maxIter = 20, optimizer = c("online", "em"), subsamplingRate = 0.05, topicConcentration = 1, docConcentration = 1, customizedStopWords = "", maxVocabSize = bitwShiftL(1, 18)) ## S4 method for signature 'LDAModel' summary(object, maxTermsPerTopic) ## S4 method for signature 'LDAModel,SparkDataFrame' spark.perplexity(object, data) ## S4 method for signature 'LDAModel,SparkDataFrame' spark.posterior(object, newData) ## S4 method for signature 'LDAModel,character' write.ml(object, path, overwrite = FALSE)
data 
A SparkDataFrame for training. 
... 
additional argument(s) passed to the method. 
object 
A Latent Dirichlet Allocation model fitted by 
newData 
A SparkDataFrame for testing. 
features 
Features column name. Either libSVMformat column or characterformat column is valid. 
k 
Number of topics. 
maxIter 
Maximum iterations. 
optimizer 
Optimizer to train an LDA model, "online" or "em", default is "online". 
subsamplingRate 
(For online optimizer) Fraction of the corpus to be sampled and used in each iteration of minibatch gradient descent, in range (0, 1]. 
topicConcentration 
concentration parameter (commonly named 
docConcentration 
concentration parameter (commonly named 
customizedStopWords 
stopwords that need to be removed from the given corpus. Ignore the parameter if libSVMformat column is used as the features column. 
maxVocabSize 
maximum vocabulary size, default 1 << 18 
maxTermsPerTopic 
Maximum number of terms to collect for each topic. Default value of 10. 
path 
The directory where the model is saved. 
overwrite 
Overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists. 
spark.lda
returns a fitted Latent Dirichlet Allocation model.
summary
returns summary information of the fitted model, which is a list.
The list includes

concentration parameter commonly named 

concentration parameter commonly named 

log likelihood of the entire corpus 

log perplexity 

TRUE for distributed model while FALSE for local model 

number of terms in the corpus 

top 10 terms and their weights of all topics 

whole terms of the training corpus, NULL if libsvm format file used as training set 

Log likelihood of the observed tokens in the training set, given the current parameter estimates: log P(docs  topics, topic distributions for docs, Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em") 

Log probability of the current parameter estimate: log P(topics, topic distributions for docs  Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em") 
spark.perplexity
returns the log perplexity of given SparkDataFrame, or the log
perplexity of the training data if missing argument "data".
spark.posterior
returns a SparkDataFrame containing posterior probabilities
vectors named "topicDistribution".
spark.lda since 2.1.0
summary(LDAModel) since 2.1.0
spark.perplexity(LDAModel) since 2.1.0
spark.posterior(LDAModel) since 2.1.0
write.ml(LDAModel, character) since 2.1.0
topicmodels: https://cran.rproject.org/package=topicmodels
## Not run:
##D text < read.df("data/mllib/sample_lda_libsvm_data.txt", source = "libsvm")
##D model < spark.lda(data = text, optimizer = "em")
##D
##D # get a summary of the model
##D summary(model)
##D
##D # compute posterior probabilities
##D posterior < spark.posterior(model, text)
##D showDF(posterior)
##D
##D # compute perplexity
##D perplexity < spark.perplexity(model, text)
##D
##D # save and load the model
##D path < "path/to/model"
##D write.ml(model, path)
##D savedModel < read.ml(path)
##D summary(savedModel)
## End(Not run)