|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
Object org.apache.spark.mllib.clustering.KMeansModel org.apache.spark.mllib.clustering.StreamingKMeansModel
public class StreamingKMeansModel
:: Experimental ::
StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.
The update algorithm uses the "mini-batch" KMeans rule, generalized to incorporate forgetfullness (i.e. decay). The update rule (for each cluster) is:
c_t+1 = [(c_t * n_t * a) + (x_t * m_t)] / [n_t + m_t]
n_t+t = n_t * a + m_t
Where c_t is the previously estimated centroid for that cluster, n_t is the number of points assigned to it thus far, x_t is the centroid estimated on the current batch, and m_t is the number of points assigned to that centroid in the current batch.
The decay factor 'a' scales the contribution of the clusters as estimated thus far, by applying a as a discount weighting on the current point when evaluating new incoming data. If a=1, all batches are weighted equally. If a=0, new centroids are determined entirely by recent data. Lower values correspond to more forgetting.
Decay can optionally be specified by a half life and associated time unit. The time unit can either be a batch of data or a single data point. Considering data arrived at time t, the half life h is defined such that at time t + h the discount applied to the data from t is 0.5. The definition remains the same whether the time unit is given as batches or points.
Constructor Summary | |
---|---|
StreamingKMeansModel(Vector[] clusterCenters,
double[] clusterWeights)
|
Method Summary | |
---|---|
Vector[] |
clusterCenters()
|
double[] |
clusterWeights()
|
StreamingKMeansModel |
update(RDD<Vector> data,
double decayFactor,
String timeUnit)
Perform a k-means update on a batch of data. |
Methods inherited from class org.apache.spark.mllib.clustering.KMeansModel |
---|
computeCost, k, load, predict, predict, predict, save |
Methods inherited from class Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.apache.spark.Logging |
---|
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning |
Methods inherited from interface org.apache.spark.mllib.pmml.PMMLExportable |
---|
toPMML, toPMML, toPMML, toPMML, toPMML |
Constructor Detail |
---|
public StreamingKMeansModel(Vector[] clusterCenters, double[] clusterWeights)
Method Detail |
---|
public Vector[] clusterCenters()
clusterCenters
in class KMeansModel
public double[] clusterWeights()
public StreamingKMeansModel update(RDD<Vector> data, double decayFactor, String timeUnit)
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |