org.apache.spark.mllib.clustering
Class KMeans

Object
  extended by org.apache.spark.mllib.clustering.KMeans
All Implemented Interfaces:
java.io.Serializable, Logging

public class KMeans
extends Object
implements scala.Serializable, Logging

K-means clustering with support for multiple parallel runs and a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al). When multiple concurrent runs are requested, they are executed together with joint passes over the data for efficiency.

This is an iterative algorithm that will make multiple passes over the data, so any RDDs given to it should be cached by the user.

See Also:
Serialized Form

Constructor Summary
KMeans()
          Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, runs: 1, initializationMode: "k-means||", initializationSteps: 5, epsilon: 1e-4, seed: random}.
 
Method Summary
 double getEpsilon()
          The distance threshold within which we've consider centers to have converged.
 String getInitializationMode()
          The initialization algorithm.
 int getInitializationSteps()
          Number of steps for the k-means|| initialization mode
 int getK()
          Number of clusters to create (k).
 int getMaxIterations()
          Maximum number of iterations to run.
 int getRuns()
          :: Experimental :: Number of runs of the algorithm to execute in parallel.
 long getSeed()
          The random seed for cluster initialization.
static String K_MEANS_PARALLEL()
           
static String RANDOM()
           
 KMeansModel run(RDD<Vector> data)
          Train a K-means model on the given set of points; data should be cached for high performance, because this is an iterative algorithm.
 KMeans setEpsilon(double epsilon)
          Set the distance threshold within which we've consider centers to have converged.
 KMeans setInitializationMode(String initializationMode)
          Set the initialization algorithm.
 KMeans setInitializationSteps(int initializationSteps)
          Set the number of steps for the k-means|| initialization mode.
 KMeans setK(int k)
          Set the number of clusters to create (k).
 KMeans setMaxIterations(int maxIterations)
          Set maximum number of iterations to run.
 KMeans setRuns(int runs)
          :: Experimental :: Set the number of runs of the algorithm to execute in parallel.
 KMeans setSeed(long seed)
          Set the random seed for cluster initialization.
static KMeansModel train(RDD<Vector> data, int k, int maxIterations)
          Trains a k-means model using specified parameters and the default values for unspecified.
static KMeansModel train(RDD<Vector> data, int k, int maxIterations, int runs)
          Trains a k-means model using specified parameters and the default values for unspecified.
static KMeansModel train(RDD<Vector> data, int k, int maxIterations, int runs, String initializationMode)
          Trains a k-means model using the given set of parameters.
static KMeansModel train(RDD<Vector> data, int k, int maxIterations, int runs, String initializationMode, long seed)
          Trains a k-means model using the given set of parameters.
 
Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.spark.Logging
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning
 

Constructor Detail

KMeans

public KMeans()
Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, runs: 1, initializationMode: "k-means||", initializationSteps: 5, epsilon: 1e-4, seed: random}.

Method Detail

RANDOM

public static String RANDOM()

K_MEANS_PARALLEL

public static String K_MEANS_PARALLEL()

train

public static KMeansModel train(RDD<Vector> data,
                                int k,
                                int maxIterations,
                                int runs,
                                String initializationMode,
                                long seed)
Trains a k-means model using the given set of parameters.

Parameters:
data - training points stored as RDD[Vector]
k - number of clusters
maxIterations - max number of iterations
runs - number of parallel runs, defaults to 1. The best model is returned.
initializationMode - initialization model, either "random" or "k-means||" (default).
seed - random seed value for cluster initialization
Returns:
(undocumented)

train

public static KMeansModel train(RDD<Vector> data,
                                int k,
                                int maxIterations,
                                int runs,
                                String initializationMode)
Trains a k-means model using the given set of parameters.

Parameters:
data - training points stored as RDD[Vector]
k - number of clusters
maxIterations - max number of iterations
runs - number of parallel runs, defaults to 1. The best model is returned.
initializationMode - initialization model, either "random" or "k-means||" (default).
Returns:
(undocumented)

train

public static KMeansModel train(RDD<Vector> data,
                                int k,
                                int maxIterations)
Trains a k-means model using specified parameters and the default values for unspecified.

Parameters:
data - (undocumented)
k - (undocumented)
maxIterations - (undocumented)
Returns:
(undocumented)

train

public static KMeansModel train(RDD<Vector> data,
                                int k,
                                int maxIterations,
                                int runs)
Trains a k-means model using specified parameters and the default values for unspecified.

Parameters:
data - (undocumented)
k - (undocumented)
maxIterations - (undocumented)
runs - (undocumented)
Returns:
(undocumented)

getK

public int getK()
Number of clusters to create (k).

Returns:
(undocumented)

setK

public KMeans setK(int k)
Set the number of clusters to create (k). Default: 2.


getMaxIterations

public int getMaxIterations()
Maximum number of iterations to run.

Returns:
(undocumented)

setMaxIterations

public KMeans setMaxIterations(int maxIterations)
Set maximum number of iterations to run. Default: 20.


getInitializationMode

public String getInitializationMode()
The initialization algorithm. This can be either "random" or "k-means||".

Returns:
(undocumented)

setInitializationMode

public KMeans setInitializationMode(String initializationMode)
Set the initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.

Parameters:
initializationMode - (undocumented)
Returns:
(undocumented)

getRuns

public int getRuns()
:: Experimental :: Number of runs of the algorithm to execute in parallel.

Returns:
(undocumented)

setRuns

public KMeans setRuns(int runs)
:: Experimental :: Set the number of runs of the algorithm to execute in parallel. We initialize the algorithm this many times with random starting conditions (configured by the initialization mode), then return the best clustering found over any run. Default: 1.

Parameters:
runs - (undocumented)
Returns:
(undocumented)

getInitializationSteps

public int getInitializationSteps()
Number of steps for the k-means|| initialization mode

Returns:
(undocumented)

setInitializationSteps

public KMeans setInitializationSteps(int initializationSteps)
Set the number of steps for the k-means|| initialization mode. This is an advanced setting -- the default of 5 is almost always enough. Default: 5.

Parameters:
initializationSteps - (undocumented)
Returns:
(undocumented)

getEpsilon

public double getEpsilon()
The distance threshold within which we've consider centers to have converged.

Returns:
(undocumented)

setEpsilon

public KMeans setEpsilon(double epsilon)
Set the distance threshold within which we've consider centers to have converged. If all centers move less than this Euclidean distance, we stop iterating one run.

Parameters:
epsilon - (undocumented)
Returns:
(undocumented)

getSeed

public long getSeed()
The random seed for cluster initialization.

Returns:
(undocumented)

setSeed

public KMeans setSeed(long seed)
Set the random seed for cluster initialization.


run

public KMeansModel run(RDD<Vector> data)
Train a K-means model on the given set of points; data should be cached for high performance, because this is an iterative algorithm.

Parameters:
data - (undocumented)
Returns:
(undocumented)