R: Latent Dirichlet Allocation

spark.lda {SparkR}

R Documentation

Latent Dirichlet Allocation

Description

spark.lda fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call summary to get a summary of the fitted LDA model, spark.posterior to compute posterior probabilities on new data, spark.perplexity to compute log perplexity on new data and write.ml/read.ml to save/load fitted models.

Usage

spark.lda(data, ...)

spark.posterior(object, newData)

spark.perplexity(object, data)

## S4 method for signature 'LDAModel,SparkDataFrame'
spark.posterior(object, newData)

## S4 method for signature 'LDAModel'
summary(object, maxTermsPerTopic)

## S4 method for signature 'LDAModel,SparkDataFrame'
spark.perplexity(object, data)

## S4 method for signature 'LDAModel,character'
write.ml(object, path, overwrite = FALSE)

## S4 method for signature 'SparkDataFrame'
spark.lda(data, features = "features", k = 10,
  maxIter = 20, optimizer = c("online", "em"), subsamplingRate = 0.05,
  topicConcentration = -1, docConcentration = -1,
  customizedStopWords = "", maxVocabSize = bitwShiftL(1, 18))

Arguments

`data`	A SparkDataFrame for training.
`...`	additional argument(s) passed to the method.
`object`	A Latent Dirichlet Allocation model fitted by `spark.lda`.
`newData`	A SparkDataFrame for testing.
`maxTermsPerTopic`	Maximum number of terms to collect for each topic. Default value of 10.
`path`	The directory where the model is saved.
`overwrite`	Overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists.
`features`	Features column name. Either libSVM-format column or character-format column is valid.
`k`	Number of topics.
`maxIter`	Maximum iterations.
`optimizer`	Optimizer to train an LDA model, "online" or "em", default is "online".
`subsamplingRate`	(For online optimizer) Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1].
`topicConcentration`	concentration parameter (commonly named `beta` or `eta`) for the prior placed on topic distributions over terms, default -1 to set automatically on the Spark side. Use `summary` to retrieve the effective topicConcentration. Only 1-size numeric is accepted.
`docConcentration`	concentration parameter (commonly named `alpha`) for the prior placed on documents distributions over topics (`theta`), default -1 to set automatically on the Spark side. Use `summary` to retrieve the effective docConcentration. Only 1-size or `k`-size numeric is accepted.
`customizedStopWords`	stopwords that need to be removed from the given corpus. Ignore the parameter if libSVM-format column is used as the features column.
`maxVocabSize`	maximum vocabulary size, default 1 << 18

Value

spark.posterior returns a SparkDataFrame containing posterior probabilities vectors named "topicDistribution".

summary returns summary information of the fitted model, which is a list. The list includes

`docConcentration`	concentration parameter commonly named `alpha` for the prior placed on documents distributions over topics `theta`
`topicConcentration`	concentration parameter commonly named `beta` or `eta` for the prior placed on topic distributions over terms
`logLikelihood`	log likelihood of the entire corpus
`logPerplexity`	log perplexity
`isDistributed`	TRUE for distributed model while FALSE for local model
`vocabSize`	number of terms in the corpus
`topics`	top 10 terms and their weights of all topics
`vocabulary`	whole terms of the training corpus, NULL if libsvm format file used as training set

spark.perplexity returns the log perplexity of given SparkDataFrame, or the log perplexity of the training data if missing argument "data".

spark.lda returns a fitted Latent Dirichlet Allocation model.

Note

spark.posterior(LDAModel) since 2.1.0

summary(LDAModel) since 2.1.0

spark.perplexity(LDAModel) since 2.1.0

write.ml(LDAModel, character) since 2.1.0

spark.lda since 2.1.0

Examples

## Not run: 
##D # nolint start
##D # An example "path/to/file" can be
##D # paste0(Sys.getenv("SPARK_HOME"), "/data/mllib/sample_lda_libsvm_data.txt")
##D # nolint end
##D text <- read.df("path/to/file", source = "libsvm")
##D model <- spark.lda(data = text, optimizer = "em")
##D 
##D # get a summary of the model
##D summary(model)
##D 
##D # compute posterior probabilities
##D posterior <- spark.posterior(model, text)
##D showDF(posterior)
##D 
##D # compute perplexity
##D perplexity <- spark.perplexity(model, text)
##D 
##D # save and load the model
##D path <- "path/to/model"
##D write.ml(model, path)
##D savedModel <- read.ml(path)
##D summary(savedModel)
## End(Not run)

[Package SparkR version 2.1.1 Index]