org.apache.spark.mllib.clustering
Class GaussianMixture

Object
  extended by org.apache.spark.mllib.clustering.GaussianMixture
All Implemented Interfaces:
java.io.Serializable

public class GaussianMixture
extends Object
implements scala.Serializable

:: Experimental ::

This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite.

Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.

Note: For high-dimensional data (with many features), this algorithm may perform poorly. This is due to high-dimensional data (a) making it difficult to cluster at all (based on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.

param: k The number of independent Gaussians in the mixture model param: convergenceTol The maximum change in log-likelihood at which convergence is considered to have occurred. param: maxIterations The maximum number of iterations to perform

See Also:
Serialized Form

Constructor Summary
GaussianMixture()
          Constructs a default instance.
 
Method Summary
 double getConvergenceTol()
          Return the largest change in log-likelihood at which convergence is considered to have occurred.
 scala.Option<GaussianMixtureModel> getInitialModel()
          Return the user supplied initial GMM, if supplied
 int getK()
          Return the number of Gaussians in the mixture model
 int getMaxIterations()
          Return the maximum number of iterations to run
 long getSeed()
          Return the random seed
 GaussianMixtureModel run(JavaRDD<Vector> data)
          Java-friendly version of run()
 GaussianMixtureModel run(RDD<Vector> data)
          Perform expectation maximization
 GaussianMixture setConvergenceTol(double convergenceTol)
          Set the largest change in log-likelihood at which convergence is considered to have occurred.
 GaussianMixture setInitialModel(GaussianMixtureModel model)
          Set the initial GMM starting point, bypassing the random initialization.
 GaussianMixture setK(int k)
          Set the number of Gaussians in the mixture model.
 GaussianMixture setMaxIterations(int maxIterations)
          Set the maximum number of iterations to run.
 GaussianMixture setSeed(long seed)
          Set the random seed
 
Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

GaussianMixture

public GaussianMixture()
Constructs a default instance. The default parameters are {k: 2, convergenceTol: 0.01, maxIterations: 100, seed: random}.

Method Detail

setInitialModel

public GaussianMixture setInitialModel(GaussianMixtureModel model)
Set the initial GMM starting point, bypassing the random initialization. You must call setK() prior to calling this method, and the condition (model.k == this.k) must be met; failure will result in an IllegalArgumentException

Parameters:
model - (undocumented)
Returns:
(undocumented)

getInitialModel

public scala.Option<GaussianMixtureModel> getInitialModel()
Return the user supplied initial GMM, if supplied


setK

public GaussianMixture setK(int k)
Set the number of Gaussians in the mixture model. Default: 2


getK

public int getK()
Return the number of Gaussians in the mixture model


setMaxIterations

public GaussianMixture setMaxIterations(int maxIterations)
Set the maximum number of iterations to run. Default: 100


getMaxIterations

public int getMaxIterations()
Return the maximum number of iterations to run


setConvergenceTol

public GaussianMixture setConvergenceTol(double convergenceTol)
Set the largest change in log-likelihood at which convergence is considered to have occurred.

Parameters:
convergenceTol - (undocumented)
Returns:
(undocumented)

getConvergenceTol

public double getConvergenceTol()
Return the largest change in log-likelihood at which convergence is considered to have occurred.

Returns:
(undocumented)

setSeed

public GaussianMixture setSeed(long seed)
Set the random seed


getSeed

public long getSeed()
Return the random seed


run

public GaussianMixtureModel run(RDD<Vector> data)
Perform expectation maximization


run

public GaussianMixtureModel run(JavaRDD<Vector> data)
Java-friendly version of run()