Latent Dirichlet Allocation

spark.lda fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call summary to get a summary of the fitted LDA model, spark.posterior to compute posterior probabilities on new data, spark.perplexity to compute log perplexity on new data and write.ml/read.ml to save/load fitted models.

Usage

spark.lda(data, ...)

spark.posterior(object, newData)

spark.perplexity(object, data)

# S4 method for class 'SparkDataFrame'
spark.lda(
  data,
  features = "features",
  k = 10,
  maxIter = 20,
  optimizer = c("online", "em"),
  subsamplingRate = 0.05,
  topicConcentration = -1,
  docConcentration = -1,
  customizedStopWords = "",
  maxVocabSize = bitwShiftL(1, 18)
)

# S4 method for class 'LDAModel'
summary(object, maxTermsPerTopic)

# S4 method for class 'LDAModel,SparkDataFrame'
spark.perplexity(object, data)

# S4 method for class 'LDAModel,SparkDataFrame'
spark.posterior(object, newData)

# S4 method for class 'LDAModel,character'
write.ml(object, path, overwrite = FALSE)

Arguments

data: A SparkDataFrame for training.
...: additional argument(s) passed to the method.
object: A Latent Dirichlet Allocation model fitted by spark.lda.
newData: A SparkDataFrame for testing.
features: Features column name. Either libSVM-format column or character-format column is valid.
k: Number of topics.
maxIter: Maximum iterations.
optimizer: Optimizer to train an LDA model, "online" or "em", default is "online".
subsamplingRate: (For online optimizer) Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1].
topicConcentration: concentration parameter (commonly named beta or eta) for the prior placed on topic distributions over terms, default -1 to set automatically on the Spark side. Use summary to retrieve the effective topicConcentration. Only 1-size numeric is accepted.
docConcentration: concentration parameter (commonly named alpha) for the prior placed on documents distributions over topics (theta), default -1 to set automatically on the Spark side. Use summary to retrieve the effective docConcentration. Only 1-size or k-size numeric is accepted.
customizedStopWords: stopwords that need to be removed from the given corpus. Ignore the parameter if libSVM-format column is used as the features column.
maxVocabSize: maximum vocabulary size, default 1 << 18
maxTermsPerTopic: Maximum number of terms to collect for each topic. Default value of 10.
path: The directory where the model is saved.
overwrite: Overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists.

Value

spark.lda returns a fitted Latent Dirichlet Allocation model.

summary returns summary information of the fitted model, which is a list. The list includes

docConcentration: concentration parameter commonly named alpha for the prior placed on documents distributions over topics theta
topicConcentration: concentration parameter commonly named beta or eta for the prior placed on topic distributions over terms
logLikelihood: log likelihood of the entire corpus
logPerplexity: log perplexity
isDistributed: TRUE for distributed model while FALSE for local model
vocabSize: number of terms in the corpus
topics: top 10 terms and their weights of all topics
vocabulary: whole terms of the training corpus, NULL if libsvm format file used as training set
trainingLogLikelihood: Log likelihood of the observed tokens in the training set, given the current parameter estimates: log P(docs | topics, topic distributions for docs, Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em")
logPrior: Log probability of the current parameter estimate: log P(topics, topic distributions for docs | Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em")

spark.perplexity returns the log perplexity of given SparkDataFrame, or the log perplexity of the training data if missing argument "data".

spark.posterior returns a SparkDataFrame containing posterior probabilities vectors named "topicDistribution".

Note

spark.lda since 2.1.0

summary(LDAModel) since 2.1.0

spark.perplexity(LDAModel) since 2.1.0

spark.posterior(LDAModel) since 2.1.0

write.ml(LDAModel, character) since 2.1.0

Examples

if (FALSE) { # \dontrun{
text <- read.df("data/mllib/sample_lda_libsvm_data.txt", source = "libsvm")
model <- spark.lda(data = text, optimizer = "em")

# get a summary of the model
summary(model)

# compute posterior probabilities
posterior <- spark.posterior(model, text)
showDF(posterior)

# compute perplexity
perplexity <- spark.perplexity(model, text)

# save and load the model
path <- "path/to/model"
write.ml(model, path)
savedModel <- read.ml(path)
summary(savedModel)
} # }

Usage

Arguments

Value

Note

See also

Examples