Class GaussianMixture

Object
org.apache.spark.mllib.clustering.GaussianMixture
All Implemented Interfaces:
Serializable

public class GaussianMixture extends Object implements Serializable
This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite.

Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.

param: k Number of independent Gaussians in the mixture model. param: convergenceTol Maximum change in log-likelihood at which convergence is considered to have occurred. param: maxIterations Maximum number of iterations allowed.

See Also:
Note:
This algorithm is limited in its number of features since it requires storing a covariance matrix which has size quadratic in the number of features. Even when the number of features does not exceed this limit, this algorithm may perform poorly on high-dimensional data. This is due to high-dimensional data (a) making it difficult to cluster at all (based on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.
  • Constructor Details

    • GaussianMixture

      public GaussianMixture()
      Constructs a default instance. The default parameters are {k: 2, convergenceTol: 0.01, maxIterations: 100, seed: random}.
  • Method Details

    • shouldDistributeGaussians

      public static boolean shouldDistributeGaussians(int k, int d)
      Heuristic to distribute the computation of the MultivariateGaussians, approximately when d is greater than 25 except for when k is very small.
      Parameters:
      k - Number of topics
      d - Number of features
      Returns:
      (undocumented)
    • setInitialModel

      public GaussianMixture setInitialModel(GaussianMixtureModel model)
      Set the initial GMM starting point, bypassing the random initialization. You must call setK() prior to calling this method, and the condition (model.k == this.k) must be met; failure will result in an IllegalArgumentException
      Parameters:
      model - (undocumented)
      Returns:
      (undocumented)
    • getInitialModel

      public scala.Option<GaussianMixtureModel> getInitialModel()
      Return the user supplied initial GMM, if supplied
      Returns:
      (undocumented)
    • setK

      public GaussianMixture setK(int k)
      Set the number of Gaussians in the mixture model. Default: 2
      Parameters:
      k - (undocumented)
      Returns:
      (undocumented)
    • getK

      public int getK()
      Return the number of Gaussians in the mixture model
      Returns:
      (undocumented)
    • setMaxIterations

      public GaussianMixture setMaxIterations(int maxIterations)
      Set the maximum number of iterations allowed. Default: 100
      Parameters:
      maxIterations - (undocumented)
      Returns:
      (undocumented)
    • getMaxIterations

      public int getMaxIterations()
      Return the maximum number of iterations allowed
      Returns:
      (undocumented)
    • setConvergenceTol

      public GaussianMixture setConvergenceTol(double convergenceTol)
      Set the largest change in log-likelihood at which convergence is considered to have occurred.
      Parameters:
      convergenceTol - (undocumented)
      Returns:
      (undocumented)
    • getConvergenceTol

      public double getConvergenceTol()
      Return the largest change in log-likelihood at which convergence is considered to have occurred.
      Returns:
      (undocumented)
    • setSeed

      public GaussianMixture setSeed(long seed)
      Set the random seed
      Parameters:
      seed - (undocumented)
      Returns:
      (undocumented)
    • getSeed

      public long getSeed()
      Return the random seed
      Returns:
      (undocumented)
    • run

      public GaussianMixtureModel run(RDD<Vector> data)
      Perform expectation maximization
      Parameters:
      data - (undocumented)
      Returns:
      (undocumented)
    • run

      public GaussianMixtureModel run(JavaRDD<Vector> data)
      Java-friendly version of run()
      Parameters:
      data - (undocumented)
      Returns:
      (undocumented)