Latent Dirichlet Allocation
spark.lda.Rd
spark.lda
fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
summary
to get a summary of the fitted LDA model, spark.posterior
to compute
posterior probabilities on new data, spark.perplexity
to compute log perplexity on new
data and write.ml
/read.ml
to save/load fitted models.
Usage
spark.lda(data, ...)
spark.posterior(object, newData)
spark.perplexity(object, data)
# S4 method for class 'SparkDataFrame'
spark.lda(
data,
features = "features",
k = 10,
maxIter = 20,
optimizer = c("online", "em"),
subsamplingRate = 0.05,
topicConcentration = -1,
docConcentration = -1,
customizedStopWords = "",
maxVocabSize = bitwShiftL(1, 18)
)
# S4 method for class 'LDAModel'
summary(object, maxTermsPerTopic)
# S4 method for class 'LDAModel,SparkDataFrame'
spark.perplexity(object, data)
# S4 method for class 'LDAModel,SparkDataFrame'
spark.posterior(object, newData)
# S4 method for class 'LDAModel,character'
write.ml(object, path, overwrite = FALSE)
Arguments
- data
A SparkDataFrame for training.
- ...
additional argument(s) passed to the method.
- object
A Latent Dirichlet Allocation model fitted by
spark.lda
.- newData
A SparkDataFrame for testing.
- features
Features column name. Either libSVM-format column or character-format column is valid.
- k
Number of topics.
- maxIter
Maximum iterations.
- optimizer
Optimizer to train an LDA model, "online" or "em", default is "online".
- subsamplingRate
(For online optimizer) Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1].
- topicConcentration
concentration parameter (commonly named
beta
oreta
) for the prior placed on topic distributions over terms, default -1 to set automatically on the Spark side. Usesummary
to retrieve the effective topicConcentration. Only 1-size numeric is accepted.- docConcentration
concentration parameter (commonly named
alpha
) for the prior placed on documents distributions over topics (theta
), default -1 to set automatically on the Spark side. Usesummary
to retrieve the effective docConcentration. Only 1-size ork
-size numeric is accepted.- customizedStopWords
stopwords that need to be removed from the given corpus. Ignore the parameter if libSVM-format column is used as the features column.
- maxVocabSize
maximum vocabulary size, default 1 << 18
- maxTermsPerTopic
Maximum number of terms to collect for each topic. Default value of 10.
- path
The directory where the model is saved.
- overwrite
Overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists.
Value
spark.lda
returns a fitted Latent Dirichlet Allocation model.
summary
returns summary information of the fitted model, which is a list.
The list includes
docConcentration
concentration parameter commonly named
alpha
for the prior placed on documents distributions over topicstheta
topicConcentration
concentration parameter commonly named
beta
oreta
for the prior placed on topic distributions over termslogLikelihood
log likelihood of the entire corpus
logPerplexity
log perplexity
isDistributed
TRUE for distributed model while FALSE for local model
vocabSize
number of terms in the corpus
topics
top 10 terms and their weights of all topics
vocabulary
whole terms of the training corpus, NULL if libsvm format file used as training set
trainingLogLikelihood
Log likelihood of the observed tokens in the training set, given the current parameter estimates: log P(docs | topics, topic distributions for docs, Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em")
logPrior
Log probability of the current parameter estimate: log P(topics, topic distributions for docs | Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em")
spark.perplexity
returns the log perplexity of given SparkDataFrame, or the log
perplexity of the training data if missing argument "data".
spark.posterior
returns a SparkDataFrame containing posterior probabilities
vectors named "topicDistribution".
Note
spark.lda since 2.1.0
summary(LDAModel) since 2.1.0
spark.perplexity(LDAModel) since 2.1.0
spark.posterior(LDAModel) since 2.1.0
write.ml(LDAModel, character) since 2.1.0
See also
topicmodels: https://cran.r-project.org/package=topicmodels
Examples
if (FALSE) { # \dontrun{
text <- read.df("data/mllib/sample_lda_libsvm_data.txt", source = "libsvm")
model <- spark.lda(data = text, optimizer = "em")
# get a summary of the model
summary(model)
# compute posterior probabilities
posterior <- spark.posterior(model, text)
showDF(posterior)
# compute perplexity
perplexity <- spark.perplexity(model, text)
# save and load the model
path <- "path/to/model"
write.ml(model, path)
savedModel <- read.ml(path)
summary(savedModel)
} # }