StreamingKMeansModel (Spark 3.4.1 JavaDoc)

Object
- org.apache.spark.mllib.clustering.KMeansModel
- - org.apache.spark.mllib.clustering.StreamingKMeansModel

All Implemented Interfaces:

java.io.Serializable, org.apache.spark.internal.Logging, PMMLExportable, Saveable
```
public class StreamingKMeansModel
extends KMeansModel
implements org.apache.spark.internal.Logging
```
StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.
The update algorithm uses the "mini-batch" KMeans rule, generalized to incorporate forgetfulness (i.e. decay). The update rule (for each cluster) is:

$$ \begin{align} c_{t+1} &= [(c_t * n_t * a) + (x_t * m_t)] / [n_t + m_t] \\ n_{t+1} &= n_t * a + m_t \end{align} $$

Where c_t is the previously estimated centroid for that cluster, n_t is the number of points assigned to it thus far, x_t is the centroid estimated on the current batch, and m_t is the number of points assigned to that centroid in the current batch.
The decay factor 'a' scales the contribution of the clusters as estimated thus far, by applying a as a discount weighting on the current point when evaluating new incoming data. If a=1, all batches are weighted equally. If a=0, new centroids are determined entirely by recent data. Lower values correspond to more forgetting.
Decay can optionally be specified by a half life and associated time unit. The time unit can either be a batch of data or a single data point. Considering data arrived at time t, the half life h is defined such that at time t + h the discount applied to the data from t is 0.5. The definition remains the same whether the time unit is given as batches or points.

See Also:

Serialized Form

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.spark.mllib.clustering.KMeansModel
  KMeansModel.Cluster$, KMeansModel.SaveLoadV1_0$, KMeansModel.SaveLoadV2_0$
- Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
  org.apache.spark.internal.Logging.SparkShellLoggingFilter

Constructor Summary

Constructors
Constructor and Description

StreamingKMeansModel(Vector[] clusterCenters, double[] clusterWeights)

Constructors
Constructor and Description
`StreamingKMeansModel(Vector[] clusterCenters, double[] clusterWeights)`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`Vector[]`	`clusterCenters()`
`double[]`	`clusterWeights()`
`StreamingKMeansModel`	`update(RDD<Vector> data, double decayFactor, String timeUnit)` Perform a k-means update on a batch of data.

Methods inherited from class org.apache.spark.mllib.clustering.KMeansModel
computeCost, distanceMeasure, k, load, predict, predict, predict, save, trainingCost

Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.internal.Logging
$init$, initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, initLock, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log__$eq, org$apache$spark$internal$Logging$$log_, uninitialize

Methods inherited from interface org.apache.spark.mllib.pmml.PMMLExportable
toPMML, toPMML, toPMML, toPMML, toPMML

Constructor Detail

StreamingKMeansModel

public StreamingKMeansModel(Vector[] clusterCenters,
                            double[] clusterWeights)

Method Detail

clusterCenters
```
public Vector[] clusterCenters()
```
Overrides:

clusterCenters in class KMeansModel

clusterWeights
```
public double[] clusterWeights()
```

update

public StreamingKMeansModel update(RDD<Vector> data,
                                   double decayFactor,
                                   String timeUnit)

Perform a k-means update on a batch of data.

Parameters:: data - (undocumented); decayFactor - (undocumented); timeUnit - (undocumented)
Returns:: (undocumented)

Class StreamingKMeansModel

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.spark.mllib.clustering.KMeansModel

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

Constructor Summary

Method Summary

Methods inherited from class org.apache.spark.mllib.clustering.KMeansModel

Methods inherited from class Object

Methods inherited from interface org.apache.spark.internal.Logging

Methods inherited from interface org.apache.spark.mllib.pmml.PMMLExportable

Constructor Detail

StreamingKMeansModel

Method Detail

clusterCenters

clusterWeights

update