class pyspark.mllib.clustering.KMeans[source]

K-means clustering.

New in version 0.9.0.


train(rdd, k[, maxIterations, …])

Train a k-means clustering model.

Methods Documentation

classmethod train(rdd, k, maxIterations=100, initializationMode='k-means||', seed=None, initializationSteps=2, epsilon=0.0001, initialModel=None)[source]

Train a k-means clustering model.

New in version 0.9.0.


Training points as an RDD of pyspark.mllib.linalg.Vector or convertible sequence types.


Number of clusters to create.

maxIterationsint, optional

Maximum number of iterations allowed. (default: 100)

initializationModestr, optional

The initialization algorithm. This can be either “random” or “k-means||”. (default: “k-means||”)

seedint, optional

Random seed value for cluster initialization. Set as None to generate seed based on system time. (default: None)

initializationSteps :

Number of steps for the k-means|| initialization mode. This is an advanced setting – the default of 2 is almost always enough. (default: 2)

epsilonfloat, optional

Distance threshold within which a center will be considered to have converged. If all centers move less than this Euclidean distance, iterations are stopped. (default: 1e-4)

initialModelKMeansModel, optional

Initial cluster centers can be provided as a KMeansModel object rather than using the random or k-means|| initializationModel. (default: None)