Class StreamingKMeansModel

All Implemented Interfaces:
Serializable, org.apache.spark.internal.Logging, PMMLExportable, Saveable, scala.Serializable

public class StreamingKMeansModel extends KMeansModel implements org.apache.spark.internal.Logging
StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.

The update algorithm uses the "mini-batch" KMeans rule, generalized to incorporate forgetfulness (i.e. decay). The update rule (for each cluster) is:

$$ \begin{align} c_{t+1} &= [(c_t * n_t * a) + (x_t * m_t)] / [n_t + m_t] \\ n_{t+1} &= n_t * a + m_t \end{align} $$

Where c_t is the previously estimated centroid for that cluster, n_t is the number of points assigned to it thus far, x_t is the centroid estimated on the current batch, and m_t is the number of points assigned to that centroid in the current batch.

The decay factor 'a' scales the contribution of the clusters as estimated thus far, by applying a as a discount weighting on the current point when evaluating new incoming data. If a=1, all batches are weighted equally. If a=0, new centroids are determined entirely by recent data. Lower values correspond to more forgetting.

Decay can optionally be specified by a half life and associated time unit. The time unit can either be a batch of data or a single data point. Considering data arrived at time t, the half life h is defined such that at time t + h the discount applied to the data from t is 0.5. The definition remains the same whether the time unit is given as batches or points.

See Also:
  • Constructor Details

    • StreamingKMeansModel

      public StreamingKMeansModel(Vector[] clusterCenters, double[] clusterWeights)
  • Method Details

    • clusterCenters

      public Vector[] clusterCenters()
      clusterCenters in class KMeansModel
    • clusterWeights

      public double[] clusterWeights()
    • update

      public StreamingKMeansModel update(RDD<Vector> data, double decayFactor, String timeUnit)
      Perform a k-means update on a batch of data.
      data - (undocumented)
      decayFactor - (undocumented)
      timeUnit - (undocumented)