org.apache.spark.mllib.clustering

LDA

class LDA extends Logging

:: Experimental ::

Latent Dirichlet Allocation (LDA), a topic model designed for text documents.

Terminology:

References:

Annotations
@Experimental()
See also

Latent Dirichlet allocation (Wikipedia)

Linear Supertypes
Logging, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. LDA
  2. Logging
  3. AnyRef
  4. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new LDA()

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  7. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  8. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  9. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  10. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  11. def getAlpha: Double

    Alias for getDocConcentration

  12. def getBeta: Double

    Alias for getTopicConcentration

  13. def getCheckpointInterval: Int

    Period (in iterations) between checkpoints.

  14. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  15. def getDocConcentration: Double

    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

    This is the parameter to a symmetric Dirichlet distribution.

  16. def getK: Int

    Number of topics to infer.

    Number of topics to infer. I.e., the number of soft cluster centers.

  17. def getMaxIterations: Int

    Maximum number of iterations for learning.

  18. def getOptimizer: LDAOptimizer

    :: DeveloperApi ::

    :: DeveloperApi ::

    LDAOptimizer used to perform the actual calculation

    Annotations
    @DeveloperApi()
  19. def getSeed: Long

    Random seed

  20. def getTopicConcentration: Double

    Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

    Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

    This is the parameter to a symmetric Dirichlet distribution.

    Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.

  21. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  22. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  23. def isTraceEnabled(): Boolean

    Attributes
    protected
    Definition Classes
    Logging
  24. def log: Logger

    Attributes
    protected
    Definition Classes
    Logging
  25. def logDebug(msg: ⇒ String, throwable: Throwable): Unit

    Attributes
    protected
    Definition Classes
    Logging
  26. def logDebug(msg: ⇒ String): Unit

    Attributes
    protected
    Definition Classes
    Logging
  27. def logError(msg: ⇒ String, throwable: Throwable): Unit

    Attributes
    protected
    Definition Classes
    Logging
  28. def logError(msg: ⇒ String): Unit

    Attributes
    protected
    Definition Classes
    Logging
  29. def logInfo(msg: ⇒ String, throwable: Throwable): Unit

    Attributes
    protected
    Definition Classes
    Logging
  30. def logInfo(msg: ⇒ String): Unit

    Attributes
    protected
    Definition Classes
    Logging
  31. def logName: String

    Attributes
    protected
    Definition Classes
    Logging
  32. def logTrace(msg: ⇒ String, throwable: Throwable): Unit

    Attributes
    protected
    Definition Classes
    Logging
  33. def logTrace(msg: ⇒ String): Unit

    Attributes
    protected
    Definition Classes
    Logging
  34. def logWarning(msg: ⇒ String, throwable: Throwable): Unit

    Attributes
    protected
    Definition Classes
    Logging
  35. def logWarning(msg: ⇒ String): Unit

    Attributes
    protected
    Definition Classes
    Logging
  36. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  37. final def notify(): Unit

    Definition Classes
    AnyRef
  38. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  39. def run(documents: JavaPairRDD[Long, Vector]): LDAModel

    Java-friendly version of run()

  40. def run(documents: RDD[(Long, Vector)]): LDAModel

    Learn an LDA model using the given dataset.

    Learn an LDA model using the given dataset.

    documents

    RDD of documents, which are term (word) count vectors paired with IDs. The term count vectors are "bags of words" with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and >= 0.

    returns

    Inferred LDA model

  41. def setAlpha(alpha: Double): LDA.this.type

    Alias for setDocConcentration()

  42. def setBeta(beta: Double): LDA.this.type

    Alias for setTopicConcentration()

  43. def setCheckpointInterval(checkpointInterval: Int): LDA.this.type

    Period (in iterations) between checkpoints (default = 10).

    Period (in iterations) between checkpoints (default = 10). Checkpointing helps with recovery (when nodes fail). It also helps with eliminating temporary shuffle files on disk, which can be important when LDA is run for many iterations. If the checkpoint directory is not set in org.apache.spark.SparkContext, this setting is ignored.

    See also

    org.apache.spark.SparkContext#setCheckpointDir

  44. def setDocConcentration(docConcentration: Double): LDA.this.type

    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

    This is the parameter to a symmetric Dirichlet distribution, where larger values mean more smoothing (more regularization).

    If set to -1, then docConcentration is set automatically. (default = -1 = automatic)

    Optimizer-specific parameter settings:

    • EM
      • Value should be > 1.0
      • default = (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM.
    • Online
  45. def setK(k: Int): LDA.this.type

    Number of topics to infer.

    Number of topics to infer. I.e., the number of soft cluster centers. (default = 10)

  46. def setMaxIterations(maxIterations: Int): LDA.this.type

    Maximum number of iterations for learning.

    Maximum number of iterations for learning. (default = 20)

  47. def setOptimizer(optimizerName: String): LDA.this.type

    Set the LDAOptimizer used to perform the actual calculation by algorithm name.

    Set the LDAOptimizer used to perform the actual calculation by algorithm name. Currently "em", "online" are supported.

  48. def setOptimizer(optimizer: LDAOptimizer): LDA.this.type

    :: DeveloperApi ::

    :: DeveloperApi ::

    LDAOptimizer used to perform the actual calculation (default = EMLDAOptimizer)

    Annotations
    @DeveloperApi()
  49. def setSeed(seed: Long): LDA.this.type

    Random seed

  50. def setTopicConcentration(topicConcentration: Double): LDA.this.type

    Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

    Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

    This is the parameter to a symmetric Dirichlet distribution.

    Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.

    If set to -1, then topicConcentration is set automatically. (default = -1 = automatic)

    Optimizer-specific parameter settings:

    • EM
      • Value should be > 1.0
      • default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM.
    • Online
  51. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  52. def toString(): String

    Definition Classes
    AnyRef → Any
  53. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  54. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  55. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Logging

Inherited from AnyRef

Inherited from Any

Ungrouped