org.apache.spark.mllib.clustering
Class StreamingKMeans

Object
  extended by org.apache.spark.mllib.clustering.StreamingKMeans
All Implemented Interfaces:
java.io.Serializable, Logging

public class StreamingKMeans
extends Object
implements Logging, scala.Serializable

:: Experimental ::

StreamingKMeans provides methods for configuring a streaming k-means analysis, training the model on streaming, and using the model to make predictions on streaming data. See KMeansModel for details on algorithm and update rules.

Use a builder pattern to construct a streaming k-means analysis in an application, like:


  val model = new StreamingKMeans()
    .setDecayFactor(0.5)
    .setK(3)
    .setRandomCenters(5, 100.0)
    .trainOn(DStream)
 

See Also:
Serialized Form

Constructor Summary
StreamingKMeans()
           
StreamingKMeans(int k, double decayFactor, String timeUnit)
           
 
Method Summary
static String BATCHES()
           
 double decayFactor()
           
 int k()
           
 StreamingKMeansModel latestModel()
          Return the latest model.
static String POINTS()
           
 DStream<Object> predictOn(DStream<Vector> data)
          Use the clustering model to make predictions on batches of data from a DStream.
 JavaDStream<Integer> predictOn(JavaDStream<Vector> data)
          Java-friendly version of `predictOn`.
<K> DStream<scala.Tuple2<K,Object>>
predictOnValues(DStream<scala.Tuple2<K,Vector>> data, scala.reflect.ClassTag<K> evidence$1)
          Use the model to make predictions on the values of a DStream and carry over its keys.
<K> JavaPairDStream<K,Integer>
predictOnValues(JavaPairDStream<K,Vector> data)
          Java-friendly version of `predictOnValues`.
 StreamingKMeans setDecayFactor(double a)
          Set the decay factor directly (for forgetful algorithms).
 StreamingKMeans setHalfLife(double halfLife, String timeUnit)
          Set the half life and time unit ("batches" or "points") for forgetful algorithms.
 StreamingKMeans setInitialCenters(Vector[] centers, double[] weights)
          Specify initial centers directly.
 StreamingKMeans setK(int k)
          Set the number of clusters.
 StreamingKMeans setRandomCenters(int dim, double weight, long seed)
          Initialize random centers, requiring only the number of dimensions.
 String timeUnit()
           
 void trainOn(DStream<Vector> data)
          Update the clustering model by training on batches of data from a DStream.
 void trainOn(JavaDStream<Vector> data)
          Java-friendly version of `trainOn`.
 
Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.spark.Logging
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning
 

Constructor Detail

StreamingKMeans

public StreamingKMeans(int k,
                       double decayFactor,
                       String timeUnit)

StreamingKMeans

public StreamingKMeans()
Method Detail

BATCHES

public static final String BATCHES()

POINTS

public static final String POINTS()

k

public int k()

decayFactor

public double decayFactor()

timeUnit

public String timeUnit()

setK

public StreamingKMeans setK(int k)
Set the number of clusters.


setDecayFactor

public StreamingKMeans setDecayFactor(double a)
Set the decay factor directly (for forgetful algorithms).


setHalfLife

public StreamingKMeans setHalfLife(double halfLife,
                                   String timeUnit)
Set the half life and time unit ("batches" or "points") for forgetful algorithms.


setInitialCenters

public StreamingKMeans setInitialCenters(Vector[] centers,
                                         double[] weights)
Specify initial centers directly.


setRandomCenters

public StreamingKMeans setRandomCenters(int dim,
                                        double weight,
                                        long seed)
Initialize random centers, requiring only the number of dimensions.

Parameters:
dim - Number of dimensions
weight - Weight for each center
seed - Random seed
Returns:
(undocumented)

latestModel

public StreamingKMeansModel latestModel()
Return the latest model.


trainOn

public void trainOn(DStream<Vector> data)
Update the clustering model by training on batches of data from a DStream. This operation registers a DStream for training the model, checks whether the cluster centers have been initialized, and updates the model using each batch of data from the stream.

Parameters:
data - DStream containing vector data

trainOn

public void trainOn(JavaDStream<Vector> data)
Java-friendly version of `trainOn`.


predictOn

public DStream<Object> predictOn(DStream<Vector> data)
Use the clustering model to make predictions on batches of data from a DStream.

Parameters:
data - DStream containing vector data
Returns:
DStream containing predictions

predictOn

public JavaDStream<Integer> predictOn(JavaDStream<Vector> data)
Java-friendly version of `predictOn`.


predictOnValues

public <K> DStream<scala.Tuple2<K,Object>> predictOnValues(DStream<scala.Tuple2<K,Vector>> data,
                                                           scala.reflect.ClassTag<K> evidence$1)
Use the model to make predictions on the values of a DStream and carry over its keys.

Parameters:
data - DStream containing (key, feature vector) pairs
evidence$1 - (undocumented)
Returns:
DStream containing the input keys and the predictions as values

predictOnValues

public <K> JavaPairDStream<K,Integer> predictOnValues(JavaPairDStream<K,Vector> data)
Java-friendly version of `predictOnValues`.