org.apache.spark.mllib.clustering.KMeans

All Implemented Interfaces:: Serializable, org.apache.spark.internal.Logging

public class KMeans extends Object implements Serializable, org.apache.spark.internal.Logging

K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al).

This is an iterative algorithm that will make multiple passes over the data, so any RDDs given to it should be cached by the user.

See Also:

Serialized Form

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
Constructor Summary

Constructors

Constructor

Description

KMeans()

Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, initializationMode: "k-means||", initializationSteps: 2, epsilon: 1e-4, seed: random, distanceMeasure: "euclidean"}.
Method Summary

Modifier and Type

Method

Description

String

getDistanceMeasure()

The distance suite used by the algorithm.

double

getEpsilon()

The distance threshold within which we've consider centers to have converged.

String

getInitializationMode()

The initialization algorithm.

int

getInitializationSteps()

Number of steps for the k-means|| initialization mode

int

getK()

Number of clusters to create (k).

int

getMaxIterations()

Maximum number of iterations allowed.

long

getSeed()

The random seed for cluster initialization.

static String

K_MEANS_PARALLEL()

static String

RANDOM()

KMeansModel

run(RDD<Vector> data)

Train a K-means model on the given set of points; data should be cached for high performance, because this is an iterative algorithm.

KMeans

setDistanceMeasure(String distanceMeasure)

Set the distance suite used by the algorithm.

KMeans

setEpsilon(double epsilon)

Set the distance threshold within which we've consider centers to have converged.

KMeans

setInitializationMode(String initializationMode)

Set the initialization algorithm.

KMeans

setInitializationSteps(int initializationSteps)

Set the number of steps for the k-means|| initialization mode.

KMeans

setInitialModel(KMeansModel model)

Set the initial starting point, bypassing the random initialization or k-means|| The condition model.k == this.k must be met, failure results in an IllegalArgumentException.

KMeans

setK(int k)

Set the number of clusters to create (k).

KMeans

setMaxIterations(int maxIterations)

Set maximum number of iterations allowed.

KMeans

setSeed(long seed)

Set the random seed for cluster initialization.

static KMeansModel

train(RDD<Vector> data, int k, int maxIterations)

Trains a k-means model using specified parameters and the default values for unspecified.

static KMeansModel

train(RDD<Vector> data, int k, int maxIterations, String initializationMode)

Trains a k-means model using the given set of parameters.

static KMeansModel

train(RDD<Vector> data, int k, int maxIterations, String initializationMode, long seed)

Trains a k-means model using the given set of parameters.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext

Constructor Details
- KMeans
  
  public KMeans()
  
  Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, initializationMode: "k-means||", initializationSteps: 2, epsilon: 1e-4, seed: random, distanceMeasure: "euclidean"}.
Method Details
- RANDOM
  
  public static String RANDOM()
- K_MEANS_PARALLEL
  
  public static String K_MEANS_PARALLEL()
- train
  
  public static KMeansModel train(RDD<Vector> data, int k, int maxIterations, String initializationMode, long seed)
  
  Trains a k-means model using the given set of parameters.
  
  Parameters:
  
  data - Training points as an RDD of Vector types.
  
  k - Number of clusters to create.
  
  maxIterations - Maximum number of iterations allowed.
  
  initializationMode - The initialization algorithm. This can either be "random" or "k-means||". (default: "k-means||")
  
  seed - Random seed for cluster initialization. Default is to generate seed based on system time.
  
  Returns:
  
  (undocumented)
- train
  
  public static KMeansModel train(RDD<Vector> data, int k, int maxIterations, String initializationMode)
  
  Trains a k-means model using the given set of parameters.
  
  Parameters:
  
  data - Training points as an RDD of Vector types.
  
  k - Number of clusters to create.
  
  maxIterations - Maximum number of iterations allowed.
  
  initializationMode - The initialization algorithm. This can either be "random" or "k-means||". (default: "k-means||")
  
  Returns:
  
  (undocumented)
- train
  
  public static KMeansModel train(RDD<Vector> data, int k, int maxIterations)
  
  Trains a k-means model using specified parameters and the default values for unspecified.
  
  Parameters:
  
  data - (undocumented)
  
  k - (undocumented)
  
  maxIterations - (undocumented)
  
  Returns:
  
  (undocumented)
- getK
  
  public int getK()
  
  Number of clusters to create (k).
  
  Returns:
  
  (undocumented)
  
  Note:
  
  It is possible for fewer than k clusters to be returned, for example, if there are fewer than k distinct points to cluster.
- setK
  
  public KMeans setK(int k)
  
  Set the number of clusters to create (k).
  
  Parameters:
  
  k - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Note:
  
  It is possible for fewer than k clusters to be returned, for example, if there are fewer than k distinct points to cluster. Default: 2.
- getMaxIterations
  
  public int getMaxIterations()
  
  Maximum number of iterations allowed.
  
  Returns:
  
  (undocumented)
- setMaxIterations
  
  public KMeans setMaxIterations(int maxIterations)
  
  Set maximum number of iterations allowed. Default: 20.
  
  Parameters:
  
  maxIterations - (undocumented)
  
  Returns:
  
  (undocumented)
- getInitializationMode
  
  public String getInitializationMode()
  
  The initialization algorithm. This can be either "random" or "k-means||".
  
  Returns:
  
  (undocumented)
- setInitializationMode
  
  public KMeans setInitializationMode(String initializationMode)
  
  Set the initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
  
  Parameters:
  
  initializationMode - (undocumented)
  
  Returns:
  
  (undocumented)
- getInitializationSteps
  
  public int getInitializationSteps()
  
  Number of steps for the k-means|| initialization mode
  
  Returns:
  
  (undocumented)
- setInitializationSteps
  
  public KMeans setInitializationSteps(int initializationSteps)
  
  Set the number of steps for the k-means|| initialization mode. This is an advanced setting -- the default of 2 is almost always enough. Default: 2.
  
  Parameters:
  
  initializationSteps - (undocumented)
  
  Returns:
  
  (undocumented)
- getEpsilon
  
  public double getEpsilon()
  
  The distance threshold within which we've consider centers to have converged.
  
  Returns:
  
  (undocumented)
- setEpsilon
  
  public KMeans setEpsilon(double epsilon)
  
  Set the distance threshold within which we've consider centers to have converged. If all centers move less than this Euclidean distance, we stop iterating one run.
  
  Parameters:
  
  epsilon - (undocumented)
  
  Returns:
  
  (undocumented)
- getSeed
  
  public long getSeed()
  
  The random seed for cluster initialization.
  
  Returns:
  
  (undocumented)
- setSeed
  
  public KMeans setSeed(long seed)
  
  Set the random seed for cluster initialization.
  
  Parameters:
  
  seed - (undocumented)
  
  Returns:
  
  (undocumented)
- getDistanceMeasure
  
  public String getDistanceMeasure()
  
  The distance suite used by the algorithm.
  
  Returns:
  
  (undocumented)
- setDistanceMeasure
  
  public KMeans setDistanceMeasure(String distanceMeasure)
  
  Set the distance suite used by the algorithm.
  
  Parameters:
  
  distanceMeasure - (undocumented)
  
  Returns:
  
  (undocumented)
- setInitialModel
  
  public KMeans setInitialModel(KMeansModel model)
  
  Set the initial starting point, bypassing the random initialization or k-means|| The condition model.k == this.k must be met, failure results in an IllegalArgumentException.
  
  Parameters:
  
  model - (undocumented)
  
  Returns:
  
  (undocumented)
- run
  
  public KMeansModel run(RDD<Vector> data)
  
  Train a K-means model on the given set of points; data should be cached for high performance, because this is an iterative algorithm.
  
  Parameters:
  
  data - (undocumented)
  
  Returns:
  
  (undocumented)

Class KMeans

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.spark.internal.Logging

Constructor Details

KMeans

Method Details

RANDOM

K_MEANS_PARALLEL

train

train

train

getK

setK

getMaxIterations

setMaxIterations

getInitializationMode

setInitializationMode

getInitializationSteps

setInitializationSteps

getEpsilon

setEpsilon

getSeed

setSeed

getDistanceMeasure

setDistanceMeasure

setInitialModel

run