Class BisectingKMeans
Object
org.apache.spark.mllib.clustering.BisectingKMeans
- All Implemented Interfaces:
- org.apache.spark.internal.Logging
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques"
 by Steinbach, Karypis, and Kumar, with modification to fit Spark.
 The algorithm starts from a single cluster that contains all points.
 Iteratively it finds divisible clusters on the bottom level and bisects each of them using
 k-means, until there are 
k leaf clusters in total or no leaf clusters are divisible.
 The bisecting steps of clusters on the same level are grouped together to increase parallelism.
 If bisecting all divisible clusters on the bottom level would result more than k leaf clusters,
 larger clusters get higher priority.
 param: k the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters. param: maxIterations the max number of k-means iterations to split clusters (default: 20) param: minDivisibleClusterSize the minimum number of points (if greater than or equal 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1) param: seed a random seed (default: hash value of the class name)
- 
Nested Class SummaryNested classes/interfaces inherited from interface org.apache.spark.internal.Loggingorg.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
- 
Constructor SummaryConstructors
- 
Method SummaryModifier and TypeMethodDescriptionThe distance suite used by the algorithm.intgetK()Gets the desired number of leaf clusters.intGets the max number of k-means iterations to split clusters.doubleGets the minimum number of points (if greater than or equal to1.0) or the minimum proportion of points (if less than1.0) of a divisible cluster.longgetSeed()Gets the random seed.Java-friendly version ofrun().Runs the bisecting k-means algorithm.setDistanceMeasure(String distanceMeasure) Set the distance suite used by the algorithm.setK(int k) Sets the desired number of leaf clusters (default: 4).setMaxIterations(int maxIterations) Sets the max number of k-means iterations to split clusters (default: 20).setMinDivisibleClusterSize(double minDivisibleClusterSize) Sets the minimum number of points (if greater than or equal to1.0) or the minimum proportion of points (if less than1.0) of a divisible cluster (default: 1).setSeed(long seed) Sets the random seed (default: hash value of the class name).Methods inherited from class java.lang.Objectequals, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.apache.spark.internal.LogginginitializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logBasedOnLevel, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, MDC, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext
- 
Constructor Details- 
BisectingKMeanspublic BisectingKMeans()Constructs with the default configuration
 
- 
- 
Method Details- 
setKSets the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters.- Parameters:
- k- (undocumented)
- Returns:
- (undocumented)
 
- 
getKpublic int getK()Gets the desired number of leaf clusters.- Returns:
- (undocumented)
 
- 
setMaxIterationsSets the max number of k-means iterations to split clusters (default: 20).- Parameters:
- maxIterations- (undocumented)
- Returns:
- (undocumented)
 
- 
getMaxIterationspublic int getMaxIterations()Gets the max number of k-means iterations to split clusters.- Returns:
- (undocumented)
 
- 
setMinDivisibleClusterSizeSets the minimum number of points (if greater than or equal to1.0) or the minimum proportion of points (if less than1.0) of a divisible cluster (default: 1).- Parameters:
- minDivisibleClusterSize- (undocumented)
- Returns:
- (undocumented)
 
- 
getMinDivisibleClusterSizepublic double getMinDivisibleClusterSize()Gets the minimum number of points (if greater than or equal to1.0) or the minimum proportion of points (if less than1.0) of a divisible cluster.- Returns:
- (undocumented)
 
- 
setSeedSets the random seed (default: hash value of the class name).- Parameters:
- seed- (undocumented)
- Returns:
- (undocumented)
 
- 
getSeedpublic long getSeed()Gets the random seed.- Returns:
- (undocumented)
 
- 
getDistanceMeasureThe distance suite used by the algorithm.- Returns:
- (undocumented)
 
- 
setDistanceMeasureSet the distance suite used by the algorithm.- Parameters:
- distanceMeasure- (undocumented)
- Returns:
- (undocumented)
 
- 
runRuns the bisecting k-means algorithm.- Parameters:
- input- RDD of vectors
- Returns:
- model for the bisecting kmeans
 
- 
runJava-friendly version ofrun().- Parameters:
- data- (undocumented)
- Returns:
- (undocumented)
 
 
-