Class KMeans
Object
org.apache.spark.mllib.clustering.KMeans
- All Implemented Interfaces:
- Serializable,- org.apache.spark.internal.Logging
K-means clustering with a k-means++ like initialization mode
 (the k-means|| algorithm by Bahmani et al).
 
This is an iterative algorithm that will make multiple passes over the data, so any RDDs given to it should be cached by the user.
- See Also:
- 
Nested Class SummaryNested classes/interfaces inherited from interface org.apache.spark.internal.Loggingorg.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
- 
Constructor SummaryConstructorsConstructorDescriptionKMeans()Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, initializationMode: "k-means||", initializationSteps: 2, epsilon: 1e-4, seed: random, distanceMeasure: "euclidean"}.
- 
Method SummaryModifier and TypeMethodDescriptionThe distance suite used by the algorithm.doubleThe distance threshold within which we've consider centers to have converged.The initialization algorithm.intNumber of steps for the k-means|| initialization modeintgetK()Number of clusters to create (k).intMaximum number of iterations allowed.longgetSeed()The random seed for cluster initialization.static Stringstatic StringRANDOM()Train a K-means model on the given set of points;datashould be cached for high performance, because this is an iterative algorithm.setDistanceMeasure(String distanceMeasure) Set the distance suite used by the algorithm.setEpsilon(double epsilon) Set the distance threshold within which we've consider centers to have converged.setInitializationMode(String initializationMode) Set the initialization algorithm.setInitializationSteps(int initializationSteps) Set the number of steps for the k-means|| initialization mode.setInitialModel(KMeansModel model) Set the initial starting point, bypassing the random initialization or k-means|| The condition model.k == this.k must be met, failure results in an IllegalArgumentException.setK(int k) Set the number of clusters to create (k).setMaxIterations(int maxIterations) Set maximum number of iterations allowed.setSeed(long seed) Set the random seed for cluster initialization.static KMeansModelTrains a k-means model using specified parameters and the default values for unspecified.static KMeansModelTrains a k-means model using the given set of parameters.static KMeansModelTrains a k-means model using the given set of parameters.Methods inherited from class java.lang.Objectequals, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.apache.spark.internal.LogginginitializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logBasedOnLevel, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, MDC, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext
- 
Constructor Details- 
KMeanspublic KMeans()Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, initializationMode: "k-means||", initializationSteps: 2, epsilon: 1e-4, seed: random, distanceMeasure: "euclidean"}.
 
- 
- 
Method Details- 
RANDOM
- 
K_MEANS_PARALLEL
- 
trainpublic static KMeansModel train(RDD<Vector> data, int k, int maxIterations, String initializationMode, long seed) Trains a k-means model using the given set of parameters.- Parameters:
- data- Training points as an- RDDof- Vectortypes.
- k- Number of clusters to create.
- maxIterations- Maximum number of iterations allowed.
- initializationMode- The initialization algorithm. This can either be "random" or "k-means||". (default: "k-means||")
- seed- Random seed for cluster initialization. Default is to generate seed based on system time.
- Returns:
- (undocumented)
 
- 
trainpublic static KMeansModel train(RDD<Vector> data, int k, int maxIterations, String initializationMode) Trains a k-means model using the given set of parameters.- Parameters:
- data- Training points as an- RDDof- Vectortypes.
- k- Number of clusters to create.
- maxIterations- Maximum number of iterations allowed.
- initializationMode- The initialization algorithm. This can either be "random" or "k-means||". (default: "k-means||")
- Returns:
- (undocumented)
 
- 
trainTrains a k-means model using specified parameters and the default values for unspecified.- Parameters:
- data- (undocumented)
- k- (undocumented)
- maxIterations- (undocumented)
- Returns:
- (undocumented)
 
- 
getKpublic int getK()Number of clusters to create (k).- Returns:
- (undocumented)
- Note:
- It is possible for fewer than k clusters to be returned, for example, if there are fewer than k distinct points to cluster.
 
- 
setKSet the number of clusters to create (k).- Parameters:
- k- (undocumented)
- Returns:
- (undocumented)
- Note:
- It is possible for fewer than k clusters to be returned, for example, if there are fewer than k distinct points to cluster. Default: 2.
 
- 
getMaxIterationspublic int getMaxIterations()Maximum number of iterations allowed.- Returns:
- (undocumented)
 
- 
setMaxIterationsSet maximum number of iterations allowed. Default: 20.- Parameters:
- maxIterations- (undocumented)
- Returns:
- (undocumented)
 
- 
getInitializationModeThe initialization algorithm. This can be either "random" or "k-means||".- Returns:
- (undocumented)
 
- 
setInitializationModeSet the initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.- Parameters:
- initializationMode- (undocumented)
- Returns:
- (undocumented)
 
- 
getInitializationStepspublic int getInitializationSteps()Number of steps for the k-means|| initialization mode- Returns:
- (undocumented)
 
- 
setInitializationStepsSet the number of steps for the k-means|| initialization mode. This is an advanced setting -- the default of 2 is almost always enough. Default: 2.- Parameters:
- initializationSteps- (undocumented)
- Returns:
- (undocumented)
 
- 
getEpsilonpublic double getEpsilon()The distance threshold within which we've consider centers to have converged.- Returns:
- (undocumented)
 
- 
setEpsilonSet the distance threshold within which we've consider centers to have converged. If all centers move less than this Euclidean distance, we stop iterating one run.- Parameters:
- epsilon- (undocumented)
- Returns:
- (undocumented)
 
- 
getSeedpublic long getSeed()The random seed for cluster initialization.- Returns:
- (undocumented)
 
- 
setSeedSet the random seed for cluster initialization.- Parameters:
- seed- (undocumented)
- Returns:
- (undocumented)
 
- 
getDistanceMeasureThe distance suite used by the algorithm.- Returns:
- (undocumented)
 
- 
setDistanceMeasureSet the distance suite used by the algorithm.- Parameters:
- distanceMeasure- (undocumented)
- Returns:
- (undocumented)
 
- 
setInitialModelSet the initial starting point, bypassing the random initialization or k-means|| The condition model.k == this.k must be met, failure results in an IllegalArgumentException.- Parameters:
- model- (undocumented)
- Returns:
- (undocumented)
 
- 
runTrain a K-means model on the given set of points;datashould be cached for high performance, because this is an iterative algorithm.- Parameters:
- data- (undocumented)
- Returns:
- (undocumented)
 
 
-