Class KMeans

Object
org.apache.spark.mllib.clustering.KMeans
All Implemented Interfaces:
Serializable, org.apache.spark.internal.Logging

public class KMeans extends Object implements Serializable, org.apache.spark.internal.Logging
K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al).

This is an iterative algorithm that will make multiple passes over the data, so any RDDs given to it should be cached by the user.

See Also:
  • Nested Class Summary

    Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

    org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
  • Constructor Summary

    Constructors
    Constructor
    Description
    Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, initializationMode: "k-means||", initializationSteps: 2, epsilon: 1e-4, seed: random, distanceMeasure: "euclidean"}.
  • Method Summary

    Modifier and Type
    Method
    Description
    The distance suite used by the algorithm.
    double
    The distance threshold within which we've consider centers to have converged.
    The initialization algorithm.
    int
    Number of steps for the k-means|| initialization mode
    int
    Number of clusters to create (k).
    int
    Maximum number of iterations allowed.
    long
    The random seed for cluster initialization.
    static String
     
    static String
     
    run(RDD<Vector> data)
    Train a K-means model on the given set of points; data should be cached for high performance, because this is an iterative algorithm.
    setDistanceMeasure(String distanceMeasure)
    Set the distance suite used by the algorithm.
    setEpsilon(double epsilon)
    Set the distance threshold within which we've consider centers to have converged.
    setInitializationMode(String initializationMode)
    Set the initialization algorithm.
    setInitializationSteps(int initializationSteps)
    Set the number of steps for the k-means|| initialization mode.
    Set the initial starting point, bypassing the random initialization or k-means|| The condition model.k == this.k must be met, failure results in an IllegalArgumentException.
    setK(int k)
    Set the number of clusters to create (k).
    setMaxIterations(int maxIterations)
    Set maximum number of iterations allowed.
    setSeed(long seed)
    Set the random seed for cluster initialization.
    train(RDD<Vector> data, int k, int maxIterations)
    Trains a k-means model using specified parameters and the default values for unspecified.
    train(RDD<Vector> data, int k, int maxIterations, String initializationMode)
    Trains a k-means model using the given set of parameters.
    train(RDD<Vector> data, int k, int maxIterations, String initializationMode, long seed)
    Trains a k-means model using the given set of parameters.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

    Methods inherited from interface org.apache.spark.internal.Logging

    initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext
  • Constructor Details

    • KMeans

      public KMeans()
      Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, initializationMode: "k-means||", initializationSteps: 2, epsilon: 1e-4, seed: random, distanceMeasure: "euclidean"}.
  • Method Details

    • RANDOM

      public static String RANDOM()
    • K_MEANS_PARALLEL

      public static String K_MEANS_PARALLEL()
    • train

      public static KMeansModel train(RDD<Vector> data, int k, int maxIterations, String initializationMode, long seed)
      Trains a k-means model using the given set of parameters.

      Parameters:
      data - Training points as an RDD of Vector types.
      k - Number of clusters to create.
      maxIterations - Maximum number of iterations allowed.
      initializationMode - The initialization algorithm. This can either be "random" or "k-means||". (default: "k-means||")
      seed - Random seed for cluster initialization. Default is to generate seed based on system time.
      Returns:
      (undocumented)
    • train

      public static KMeansModel train(RDD<Vector> data, int k, int maxIterations, String initializationMode)
      Trains a k-means model using the given set of parameters.

      Parameters:
      data - Training points as an RDD of Vector types.
      k - Number of clusters to create.
      maxIterations - Maximum number of iterations allowed.
      initializationMode - The initialization algorithm. This can either be "random" or "k-means||". (default: "k-means||")
      Returns:
      (undocumented)
    • train

      public static KMeansModel train(RDD<Vector> data, int k, int maxIterations)
      Trains a k-means model using specified parameters and the default values for unspecified.
      Parameters:
      data - (undocumented)
      k - (undocumented)
      maxIterations - (undocumented)
      Returns:
      (undocumented)
    • getK

      public int getK()
      Number of clusters to create (k).

      Returns:
      (undocumented)
      Note:
      It is possible for fewer than k clusters to be returned, for example, if there are fewer than k distinct points to cluster.
    • setK

      public KMeans setK(int k)
      Set the number of clusters to create (k).

      Parameters:
      k - (undocumented)
      Returns:
      (undocumented)
      Note:
      It is possible for fewer than k clusters to be returned, for example, if there are fewer than k distinct points to cluster. Default: 2.
    • getMaxIterations

      public int getMaxIterations()
      Maximum number of iterations allowed.
      Returns:
      (undocumented)
    • setMaxIterations

      public KMeans setMaxIterations(int maxIterations)
      Set maximum number of iterations allowed. Default: 20.
      Parameters:
      maxIterations - (undocumented)
      Returns:
      (undocumented)
    • getInitializationMode

      public String getInitializationMode()
      The initialization algorithm. This can be either "random" or "k-means||".
      Returns:
      (undocumented)
    • setInitializationMode

      public KMeans setInitializationMode(String initializationMode)
      Set the initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
      Parameters:
      initializationMode - (undocumented)
      Returns:
      (undocumented)
    • getInitializationSteps

      public int getInitializationSteps()
      Number of steps for the k-means|| initialization mode
      Returns:
      (undocumented)
    • setInitializationSteps

      public KMeans setInitializationSteps(int initializationSteps)
      Set the number of steps for the k-means|| initialization mode. This is an advanced setting -- the default of 2 is almost always enough. Default: 2.
      Parameters:
      initializationSteps - (undocumented)
      Returns:
      (undocumented)
    • getEpsilon

      public double getEpsilon()
      The distance threshold within which we've consider centers to have converged.
      Returns:
      (undocumented)
    • setEpsilon

      public KMeans setEpsilon(double epsilon)
      Set the distance threshold within which we've consider centers to have converged. If all centers move less than this Euclidean distance, we stop iterating one run.
      Parameters:
      epsilon - (undocumented)
      Returns:
      (undocumented)
    • getSeed

      public long getSeed()
      The random seed for cluster initialization.
      Returns:
      (undocumented)
    • setSeed

      public KMeans setSeed(long seed)
      Set the random seed for cluster initialization.
      Parameters:
      seed - (undocumented)
      Returns:
      (undocumented)
    • getDistanceMeasure

      public String getDistanceMeasure()
      The distance suite used by the algorithm.
      Returns:
      (undocumented)
    • setDistanceMeasure

      public KMeans setDistanceMeasure(String distanceMeasure)
      Set the distance suite used by the algorithm.
      Parameters:
      distanceMeasure - (undocumented)
      Returns:
      (undocumented)
    • setInitialModel

      public KMeans setInitialModel(KMeansModel model)
      Set the initial starting point, bypassing the random initialization or k-means|| The condition model.k == this.k must be met, failure results in an IllegalArgumentException.
      Parameters:
      model - (undocumented)
      Returns:
      (undocumented)
    • run

      public KMeansModel run(RDD<Vector> data)
      Train a K-means model on the given set of points; data should be cached for high performance, because this is an iterative algorithm.
      Parameters:
      data - (undocumented)
      Returns:
      (undocumented)