org.apache.spark.mllib.clustering.StreamingKMeans

All Implemented Interfaces:: Serializable, org.apache.spark.internal.Logging

public class StreamingKMeans extends Object implements org.apache.spark.internal.Logging, Serializable

StreamingKMeans provides methods for configuring a streaming k-means analysis, training the model on streaming, and using the model to make predictions on streaming data. See KMeansModel for details on algorithm and update rules.

Use a builder pattern to construct a streaming k-means analysis in an application, like:


  val model = new StreamingKMeans()
    .setDecayFactor(0.5)
    .setK(3)
    .setRandomCenters(5, 100.0)
    .trainOn(DStream)

See Also:

Serialized Form

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
Constructor Summary

Constructors

Constructor

Description

StreamingKMeans()

StreamingKMeans(int k, double decayFactor, String timeUnit)
Method Summary

Modifier and Type

Method

Description

static final String

BATCHES()

double

decayFactor()

int

k()

StreamingKMeansModel

latestModel()

Return the latest model.

static final String

POINTS()

JavaDStream<Integer>

predictOn(JavaDStream<Vector> data)

Java-friendly version of predictOn.

DStream<Object>

predictOn(DStream<Vector> data)

Use the clustering model to make predictions on batches of data from a DStream.

<K> JavaPairDStream<K,Integer>

predictOnValues(JavaPairDStream<K,Vector> data)

Java-friendly version of predictOnValues.

<K> DStream<scala.Tuple2<K,Object>>

predictOnValues(DStream<scala.Tuple2<K,Vector>> data, scala.reflect.ClassTag<K> evidence$1)

Use the model to make predictions on the values of a DStream and carry over its keys.

StreamingKMeans

setDecayFactor(double a)

Set the forgetfulness of the previous centroids.

StreamingKMeans

setHalfLife(double halfLife, String timeUnit)

Set the half life and time unit ("batches" or "points").

StreamingKMeans

setInitialCenters(Vector[] centers, double[] weights)

Specify initial centers directly.

StreamingKMeans

setK(int k)

Set the number of clusters.

StreamingKMeans

setRandomCenters(int dim, double weight, long seed)

Initialize random centers, requiring only the number of dimensions.

String

timeUnit()

void

trainOn(JavaDStream<Vector> data)

Java-friendly version of trainOn.

void

trainOn(DStream<Vector> data)

Update the clustering model by training on batches of data from a DStream.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logBasedOnLevel, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, MDC, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext

Constructor Details
- StreamingKMeans
  
  public StreamingKMeans(int k, double decayFactor, String timeUnit)
- StreamingKMeans
  
  public StreamingKMeans()
Method Details
- BATCHES
  
  public static final String BATCHES()
- POINTS
  
  public static final String POINTS()
- k
  
  public int k()
- decayFactor
  
  public double decayFactor()
- timeUnit
  
  public String timeUnit()
- setK
  
  public StreamingKMeans setK(int k)
  
  Set the number of clusters.
  
  Parameters:
  
  k - (undocumented)
  
  Returns:
  
  (undocumented)
- setDecayFactor
  
  public StreamingKMeans setDecayFactor(double a)
  
  Set the forgetfulness of the previous centroids.
  
  Parameters:
  
  a - (undocumented)
  
  Returns:
  
  (undocumented)
- setHalfLife
  
  public StreamingKMeans setHalfLife(double halfLife, String timeUnit)
  
  Set the half life and time unit ("batches" or "points"). If points, then the decay factor is raised to the power of number of new points and if batches, then decay factor will be used as is.
  
  Parameters:
  
  halfLife - (undocumented)
  
  timeUnit - (undocumented)
  
  Returns:
  
  (undocumented)
- setInitialCenters
  
  public StreamingKMeans setInitialCenters(Vector[] centers, double[] weights)
  
  Specify initial centers directly.
  
  Parameters:
  
  centers - (undocumented)
  
  weights - (undocumented)
  
  Returns:
  
  (undocumented)
- setRandomCenters
  
  public StreamingKMeans setRandomCenters(int dim, double weight, long seed)
  
  Initialize random centers, requiring only the number of dimensions.
  
  Parameters:
  
  dim - Number of dimensions
  
  weight - Weight for each center
  
  seed - Random seed
  
  Returns:
  
  (undocumented)
- latestModel
  
  public StreamingKMeansModel latestModel()
  
  Return the latest model.
  
  Returns:
  
  (undocumented)
- trainOn
  
  public void trainOn(DStream<Vector> data)
  
  Update the clustering model by training on batches of data from a DStream. This operation registers a DStream for training the model, checks whether the cluster centers have been initialized, and updates the model using each batch of data from the stream.
  
  Parameters:
  
  data - DStream containing vector data
- trainOn
  
  public void trainOn(JavaDStream<Vector> data)
  
  Java-friendly version of trainOn.
  
  Parameters:
  
  data - (undocumented)
- predictOn
  
  public DStream<Object> predictOn(DStream<Vector> data)
  
  Use the clustering model to make predictions on batches of data from a DStream.
  
  Parameters:
  
  data - DStream containing vector data
  
  Returns:
  
  DStream containing predictions
- predictOn
  
  public JavaDStream<Integer> predictOn(JavaDStream<Vector> data)
  
  Java-friendly version of predictOn.
  
  Parameters:
  
  data - (undocumented)
  
  Returns:
  
  (undocumented)
- predictOnValues
  
  public <K> DStream<scala.Tuple2<K,Object>> predictOnValues(DStream<scala.Tuple2<K,Vector>> data, scala.reflect.ClassTag<K> evidence$1)
  
  Use the model to make predictions on the values of a DStream and carry over its keys.
  
  Parameters:
  
  data - DStream containing (key, feature vector) pairs
  
  evidence$1 - (undocumented)
  
  Returns:
  
  DStream containing the input keys and the predictions as values
- predictOnValues
  
  public <K> JavaPairDStream<K,Integer> predictOnValues(JavaPairDStream<K,Vector> data)
  
  Java-friendly version of predictOnValues.
  
  Parameters:
  
  data - (undocumented)
  
  Returns:
  
  (undocumented)

Class StreamingKMeans

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.spark.internal.Logging

Constructor Details

StreamingKMeans

StreamingKMeans

Method Details

BATCHES

POINTS

k

decayFactor

timeUnit

setK

setDecayFactor

setHalfLife

setInitialCenters

setRandomCenters

latestModel

trainOn

trainOn

predictOn

predictOn

predictOnValues

predictOnValues