Class BisectingKMeans

Object
org.apache.spark.mllib.clustering.BisectingKMeans
All Implemented Interfaces:
org.apache.spark.internal.Logging

public class BisectingKMeans extends Object implements org.apache.spark.internal.Logging
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.

param: k the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters. param: maxIterations the max number of k-means iterations to split clusters (default: 20) param: minDivisibleClusterSize the minimum number of points (if greater than or equal 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1) param: seed a random seed (default: hash value of the class name)

See Also:
  • Nested Class Summary

    Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

    org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
  • Constructor Summary

    Constructors
    Constructor
    Description
    Constructs with the default configuration
  • Method Summary

    Modifier and Type
    Method
    Description
    The distance suite used by the algorithm.
    int
    Gets the desired number of leaf clusters.
    int
    Gets the max number of k-means iterations to split clusters.
    double
    Gets the minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster.
    long
    Gets the random seed.
    Java-friendly version of run().
    run(RDD<Vector> input)
    Runs the bisecting k-means algorithm.
    setDistanceMeasure(String distanceMeasure)
    Set the distance suite used by the algorithm.
    setK(int k)
    Sets the desired number of leaf clusters (default: 4).
    setMaxIterations(int maxIterations)
    Sets the max number of k-means iterations to split clusters (default: 20).
    setMinDivisibleClusterSize(double minDivisibleClusterSize)
    Sets the minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1).
    setSeed(long seed)
    Sets the random seed (default: hash value of the class name).

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

    Methods inherited from interface org.apache.spark.internal.Logging

    initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext
  • Constructor Details

    • BisectingKMeans

      public BisectingKMeans()
      Constructs with the default configuration
  • Method Details

    • setK

      public BisectingKMeans setK(int k)
      Sets the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters.
      Parameters:
      k - (undocumented)
      Returns:
      (undocumented)
    • getK

      public int getK()
      Gets the desired number of leaf clusters.
      Returns:
      (undocumented)
    • setMaxIterations

      public BisectingKMeans setMaxIterations(int maxIterations)
      Sets the max number of k-means iterations to split clusters (default: 20).
      Parameters:
      maxIterations - (undocumented)
      Returns:
      (undocumented)
    • getMaxIterations

      public int getMaxIterations()
      Gets the max number of k-means iterations to split clusters.
      Returns:
      (undocumented)
    • setMinDivisibleClusterSize

      public BisectingKMeans setMinDivisibleClusterSize(double minDivisibleClusterSize)
      Sets the minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1).
      Parameters:
      minDivisibleClusterSize - (undocumented)
      Returns:
      (undocumented)
    • getMinDivisibleClusterSize

      public double getMinDivisibleClusterSize()
      Gets the minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster.
      Returns:
      (undocumented)
    • setSeed

      public BisectingKMeans setSeed(long seed)
      Sets the random seed (default: hash value of the class name).
      Parameters:
      seed - (undocumented)
      Returns:
      (undocumented)
    • getSeed

      public long getSeed()
      Gets the random seed.
      Returns:
      (undocumented)
    • getDistanceMeasure

      public String getDistanceMeasure()
      The distance suite used by the algorithm.
      Returns:
      (undocumented)
    • setDistanceMeasure

      public BisectingKMeans setDistanceMeasure(String distanceMeasure)
      Set the distance suite used by the algorithm.
      Parameters:
      distanceMeasure - (undocumented)
      Returns:
      (undocumented)
    • run

      public BisectingKMeansModel run(RDD<Vector> input)
      Runs the bisecting k-means algorithm.
      Parameters:
      input - RDD of vectors
      Returns:
      model for the bisecting kmeans
    • run

      public BisectingKMeansModel run(JavaRDD<Vector> data)
      Java-friendly version of run().
      Parameters:
      data - (undocumented)
      Returns:
      (undocumented)