Object
org.apache.spark.mllib.clustering.LDA
All Implemented Interfaces:
org.apache.spark.internal.Logging

public class LDA extends Object implements org.apache.spark.internal.Logging
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.

Terminology: - "word" = "term": an element of the vocabulary - "token": instance of a term appearing in a document - "topic": multinomial distribution over words representing some concept

References: - Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.

See Also:
  • Nested Class Summary

    Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

    org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
  • Constructor Summary

    Constructors
    Constructor
    Description
    LDA()
    Constructs a LDA instance with default parameters.
  • Method Summary

    Modifier and Type
    Method
    Description
    double
    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
    double
    int
    Period (in iterations) between checkpoints.
    double
    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
    int
    Number of topics to infer, i.e., the number of soft cluster centers.
    int
    Maximum number of iterations allowed.
    LDAOptimizer used to perform the actual calculation
    long
    Random seed for cluster initialization.
    double
    Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
    run(JavaPairRDD<Long,Vector> documents)
    Java-friendly version of run()
    run(RDD<scala.Tuple2<Object,Vector>> documents)
    Learn an LDA model using the given dataset.
    setAlpha(double alpha)
    Alias for setDocConcentration()
    Alias for setDocConcentration()
    setBeta(double beta)
    Alias for setTopicConcentration()
    setCheckpointInterval(int checkpointInterval)
    Parameter for set checkpoint interval (greater than or equal to 1) or disable checkpoint (-1).
    setDocConcentration(double docConcentration)
    Replicates a Double docConcentration to create a symmetric prior.
    setDocConcentration(Vector docConcentration)
    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
    setK(int k)
    Set the number of topics to infer, i.e., the number of soft cluster centers.
    setMaxIterations(int maxIterations)
    Set the maximum number of iterations allowed.
    setOptimizer(String optimizerName)
    Set the LDAOptimizer used to perform the actual calculation by algorithm name.
    LDAOptimizer used to perform the actual calculation (default = EMLDAOptimizer)
    setSeed(long seed)
    Set the random seed for cluster initialization.
    setTopicConcentration(double topicConcentration)
    Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

    Methods inherited from interface org.apache.spark.internal.Logging

    initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext
  • Constructor Details

    • LDA

      public LDA()
      Constructs a LDA instance with default parameters.
  • Method Details

    • getK

      public int getK()
      Number of topics to infer, i.e., the number of soft cluster centers.
      Returns:
      (undocumented)
    • setK

      public LDA setK(int k)
      Set the number of topics to infer, i.e., the number of soft cluster centers. (default = 10)
      Parameters:
      k - (undocumented)
      Returns:
      (undocumented)
    • getAsymmetricDocConcentration

      public Vector getAsymmetricDocConcentration()
      Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

      This is the parameter to a Dirichlet distribution.

      Returns:
      (undocumented)
    • getDocConcentration

      public double getDocConcentration()
      Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

      This method assumes the Dirichlet distribution is symmetric and can be described by a single Double parameter. It should fail if docConcentration is asymmetric.

      Returns:
      (undocumented)
    • setDocConcentration

      public LDA setDocConcentration(Vector docConcentration)
      Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

      This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization).

      If set to a singleton vector Vector(-1), then docConcentration is set automatically. If set to singleton vector Vector(t) where t != -1, then t is replicated to a vector of length k during LDAOptimizer.initialize(). Otherwise, the docConcentration vector must be length k. (default = Vector(-1) = automatic)

      Optimizer-specific parameter settings: - EM - Currently only supports symmetric distributions, so all values in the vector should be the same. - Values should be greater than 1.0 - default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows from Asuncion et al. (2009), who recommend a +1 adjustment for EM. - Online - Values should be greater than or equal to 0 - default = uniformly (1.0 / k), following the implementation from here.

      Parameters:
      docConcentration - (undocumented)
      Returns:
      (undocumented)
    • setDocConcentration

      public LDA setDocConcentration(double docConcentration)
      Replicates a Double docConcentration to create a symmetric prior.
      Parameters:
      docConcentration - (undocumented)
      Returns:
      (undocumented)
    • getAsymmetricAlpha

      public Vector getAsymmetricAlpha()
      Returns:
      (undocumented)
    • getAlpha

      public double getAlpha()
      Returns:
      (undocumented)
    • setAlpha

      public LDA setAlpha(Vector alpha)
      Alias for setDocConcentration()
      Parameters:
      alpha - (undocumented)
      Returns:
      (undocumented)
    • setAlpha

      public LDA setAlpha(double alpha)
      Alias for setDocConcentration()
      Parameters:
      alpha - (undocumented)
      Returns:
      (undocumented)
    • getTopicConcentration

      public double getTopicConcentration()
      Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

      This is the parameter to a symmetric Dirichlet distribution.

      Returns:
      (undocumented)
      Note:
      The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
    • setTopicConcentration

      public LDA setTopicConcentration(double topicConcentration)
      Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

      This is the parameter to a symmetric Dirichlet distribution.

      Parameters:
      topicConcentration - (undocumented)
      Returns:
      (undocumented)
      Note:
      The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.

      If set to -1, then topicConcentration is set automatically. (default = -1 = automatic)

      Optimizer-specific parameter settings: - EM - Value should be greater than 1.0 - default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM. - Online - Value should be greater than or equal to 0 - default = (1.0 / k), following the implementation from here.

    • getBeta

      public double getBeta()
      Returns:
      (undocumented)
    • setBeta

      public LDA setBeta(double beta)
      Alias for setTopicConcentration()
      Parameters:
      beta - (undocumented)
      Returns:
      (undocumented)
    • getMaxIterations

      public int getMaxIterations()
      Maximum number of iterations allowed.
      Returns:
      (undocumented)
    • setMaxIterations

      public LDA setMaxIterations(int maxIterations)
      Set the maximum number of iterations allowed. (default = 20)
      Parameters:
      maxIterations - (undocumented)
      Returns:
      (undocumented)
    • getSeed

      public long getSeed()
      Random seed for cluster initialization.
      Returns:
      (undocumented)
    • setSeed

      public LDA setSeed(long seed)
      Set the random seed for cluster initialization.
      Parameters:
      seed - (undocumented)
      Returns:
      (undocumented)
    • getCheckpointInterval

      public int getCheckpointInterval()
      Period (in iterations) between checkpoints.
      Returns:
      (undocumented)
    • setCheckpointInterval

      public LDA setCheckpointInterval(int checkpointInterval)
      Parameter for set checkpoint interval (greater than or equal to 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Checkpointing helps with recovery (when nodes fail). It also helps with eliminating temporary shuffle files on disk, which can be important when LDA is run for many iterations. If the checkpoint directory is not set in SparkContext, this setting is ignored. (default = 10)

      Parameters:
      checkpointInterval - (undocumented)
      Returns:
      (undocumented)
      See Also:
    • getOptimizer

      public LDAOptimizer getOptimizer()
      LDAOptimizer used to perform the actual calculation
      Returns:
      (undocumented)
    • setOptimizer

      public LDA setOptimizer(LDAOptimizer optimizer)
      LDAOptimizer used to perform the actual calculation (default = EMLDAOptimizer)
      Parameters:
      optimizer - (undocumented)
      Returns:
      (undocumented)
    • setOptimizer

      public LDA setOptimizer(String optimizerName)
      Set the LDAOptimizer used to perform the actual calculation by algorithm name. Currently "em", "online" are supported.
      Parameters:
      optimizerName - (undocumented)
      Returns:
      (undocumented)
    • run

      public LDAModel run(RDD<scala.Tuple2<Object,Vector>> documents)
      Learn an LDA model using the given dataset.

      Parameters:
      documents - RDD of documents, which are term (word) count vectors paired with IDs. The term count vectors are "bags of words" with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and greater than or equal to 0.
      Returns:
      Inferred LDA model
    • run

      public LDAModel run(JavaPairRDD<Long,Vector> documents)
      Java-friendly version of run()
      Parameters:
      documents - (undocumented)
      Returns:
      (undocumented)