Class LocalLDAModel

Serializable, Saveable

public class LocalLDAModel extends LDAModel implements Serializable
Local LDA model. This model stores only the inferred topics.

param: topics Inferred topics (vocabSize x k matrix).

    • load

      public static LocalLDAModel load(SparkContext sc, String path)
    • topics

      public Matrix topics()
    • docConcentration

      public Vector docConcentration()
      Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

      This is the parameter to a Dirichlet distribution.

    • topicConcentration

      public double topicConcentration()
      Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

      This is the parameter to a symmetric Dirichlet distribution.

    • k

      public int k()
      Number of topics
    • vocabSize

      public int vocabSize()
      Vocabulary size (number of terms or terms in the vocabulary)
    • topicsMatrix

      public Matrix topicsMatrix()
      Inferred topics, where each topic is represented by a distribution over terms. This is a matrix of size vocabSize x k, where each column is a topic. No guarantees are given about the ordering of the topics.
    • describeTopics

      public scala.Tuple2<int[],double[]>[] describeTopics(int maxTermsPerTopic)
      Return the topics described by weighted terms.

      maxTermsPerTopic - Maximum number of terms to collect for each topic.
      Array over topics. Each topic is represented as a pair of matching arrays: (term indices, term weights in topic). Each topic's terms are sorted in order of decreasing weight.
    • getSeed

      public long getSeed()
      Random seed for cluster initialization.
    • setSeed

      public LocalLDAModel setSeed(long seed)
      Set the random seed for cluster initialization.
      seed - (undocumented)
    • save

      public void save(SparkContext sc, String path)
      Save this model to the given path.

      This saves: - human-readable (JSON) model metadata to path/metadata/ - Parquet formatted data to path/data/

      The model may be loaded using Loader.load.

      sc - Spark context used to save model data.
      path - Path specifying the directory in which to save this model. If the directory already exists, this method throws an exception.
    • logLikelihood

      public double logLikelihood(RDD<scala.Tuple2<Object,Vector>> documents)
      Calculates a lower bound on the log likelihood of the entire corpus.

      See Equation (16) in original Online LDA paper.

      documents - test corpus to use for calculating log likelihood
      variational lower bound on the log likelihood of the entire corpus
    • logLikelihood

      public double logLikelihood(JavaPairRDD<Long,Vector> documents)
      Java-friendly version of logLikelihood
      documents - (undocumented)
    • logPerplexity

      public double logPerplexity(RDD<scala.Tuple2<Object,Vector>> documents)
      Calculate an upper bound on perplexity. (Lower is better.) See Equation (16) in original Online LDA paper.

      documents - test corpus to use for calculating perplexity
      Variational upper bound on log perplexity per token.
    • logPerplexity

      public double logPerplexity(JavaPairRDD<Long,Vector> documents)
      Java-friendly version of logPerplexity
      documents - (undocumented)
    • topicDistributions

      public RDD<scala.Tuple2<Object,Vector>> topicDistributions(RDD<scala.Tuple2<Object,Vector>> documents)
      Predicts the topic mixture distribution for each document (often called "theta" in the literature). Returns a vector of zeros for an empty document.

      This uses a variational approximation following Hoffman et al. (2010), where the approximate distribution is called "gamma." Technically, this method returns this approximation "gamma" for each document.

      documents - documents to predict topic mixture distributions for
      An RDD of (document ID, topic mixture distribution for document)
    • topicDistribution

      public Vector topicDistribution(Vector document)
      Predicts the topic mixture distribution for a document (often called "theta" in the literature). Returns a vector of zeros for an empty document.

      Note this means to allow quick query for single document. For batch documents, please refer to topicDistributions() to avoid overhead.

      document - document to predict topic mixture distributions for
      topic mixture distribution for the document
    • topicDistributions

      public JavaPairRDD<Long,Vector> topicDistributions(JavaPairRDD<Long,Vector> documents)
      Java-friendly version of topicDistributions
      documents - (undocumented)