BisectingKMeansModel¶

class pyspark.mllib.clustering.BisectingKMeansModel(java_model)[source]¶

A clustering model derived from the bisecting k-means method.

New in version 2.0.0.

Examples

>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
>>> bskm = BisectingKMeans()
>>> model = bskm.train(sc.parallelize(data, 2), k=4)
>>> p = array([0.0, 0.0])
>>> model.predict(p)
0
>>> model.k
4
>>> model.computeCost(p)
0.0

Methods

`call`(name, *a)	Call method of java_model
`computeCost`(x)	Return the Bisecting K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.
`predict`(x)	Find the cluster that each of the points belongs to in this model.

Attributes

`clusterCenters`	Get the cluster centers, represented as a list of NumPy arrays.
`k`	Get the number of clusters

Methods Documentation

call(name, *a)¶: Call method of java_model

computeCost(x)[source]¶

Return the Bisecting K-means cost (sum of squared distances of points to their nearest center) for this model on the given data. If provided with an RDD of points returns the sum.

New in version 2.0.0.

Parameters

pointpyspark.mllib.linalg.Vector or pyspark.RDD: A data point (or RDD of points) to compute the cost(s). pyspark.mllib.linalg.Vector can be replaced with equivalent objects (list, tuple, numpy.ndarray).

predict(x)[source]¶

Find the cluster that each of the points belongs to in this model.

New in version 2.0.0.

Parameters

xpyspark.mllib.linalg.Vector or pyspark.RDD: A data point (or RDD of points) to determine cluster index. pyspark.mllib.linalg.Vector can be replaced with equivalent objects (list, tuple, numpy.ndarray).

Returns

int or pyspark.RDD of int: Predicted cluster index or an RDD of predicted cluster indices if the input is an RDD.

Attributes Documentation

clusterCenters¶: Get the cluster centers, represented as a list of NumPy arrays.

New in version 2.0.0.

k¶: Get the number of clusters

New in version 2.0.0.

StreamingLogisticRegressionWithSGD BisectingKMeans