org.apache.spark.ml.clustering.BisectingKMeans

All Implemented Interfaces:: Serializable, org.apache.spark.internal.Logging, BisectingKMeansParams, Params, HasDistanceMeasure, HasFeaturesCol, HasMaxIter, HasPredictionCol, HasSeed, HasWeightCol, DefaultParamsWritable, Identifiable, MLWritable

public class BisectingKMeans extends Estimator<BisectingKMeansModel> implements BisectingKMeansParams, DefaultParamsWritable

A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.

See Also:

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
Constructor Summary

Constructors

Constructor

Description

BisectingKMeans()

BisectingKMeans(String uid)
Method Summary

Modifier and Type

Method

Description

BisectingKMeans

copy(ParamMap extra)

Creates a copy of this instance with the same UID and some extra params.

final Param<String>

distanceMeasure()

Param for The distance measure.

final Param<String>

featuresCol()

Param for features column name.

BisectingKMeansModel

fit(Dataset<?> dataset)

Fits a model to the input data.

final IntParam

k()

The desired number of leaf clusters.

static BisectingKMeans

load(String path)

final IntParam

maxIter()

Param for maximum number of iterations (>= 0).

final DoubleParam

minDivisibleClusterSize()

The minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1.0).

final Param<String>

predictionCol()

Param for prediction column name.

static MLReader<T>

read()

final LongParam

seed()

Param for random seed.

BisectingKMeans

setDistanceMeasure(String value)

BisectingKMeans

setFeaturesCol(String value)

BisectingKMeans

setK(int value)

BisectingKMeans

setMaxIter(int value)

BisectingKMeans

setMinDivisibleClusterSize(double value)

BisectingKMeans

setPredictionCol(String value)

BisectingKMeans

setSeed(long value)

BisectingKMeans

setWeightCol(String value)

Sets the value of param weightCol().

StructType

transformSchema(StructType schema)

Check transform validity and derive the output schema from the input schema.

String

uid()

An immutable unique ID for the object and its derivatives.

final Param<String>

weightCol()

Param for weight column name.

Methods inherited from class org.apache.spark.ml.Estimator
fit, fit, fit, fit

Methods inherited from class org.apache.spark.ml.PipelineStage
params

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.ml.clustering.BisectingKMeansParams
getK, getMinDivisibleClusterSize, validateAndTransformSchema

Methods inherited from interface org.apache.spark.ml.util.DefaultParamsWritable
write

Methods inherited from interface org.apache.spark.ml.param.shared.HasDistanceMeasure
getDistanceMeasure

Methods inherited from interface org.apache.spark.ml.param.shared.HasFeaturesCol
getFeaturesCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasMaxIter
getMaxIter

Methods inherited from interface org.apache.spark.ml.param.shared.HasPredictionCol
getPredictionCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasSeed
getSeed

Methods inherited from interface org.apache.spark.ml.param.shared.HasWeightCol
getWeightCol

Methods inherited from interface org.apache.spark.ml.util.Identifiable
toString

Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logBasedOnLevel, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, MDC, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext

Methods inherited from interface org.apache.spark.ml.util.MLWritable
save

Methods inherited from interface org.apache.spark.ml.param.Params
clear, copyValues, defaultCopy, defaultParamMap, estimateMatadataSize, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn

Constructor Details
- BisectingKMeans
  
  public BisectingKMeans(String uid)
- BisectingKMeans
  
  public BisectingKMeans()
Method Details
- load
  
  public static BisectingKMeans load(String path)
- read
  
  public static MLReader<T> read()
- k
  
  public final IntParam k()
  
  Description copied from interface: BisectingKMeansParams
  
  The desired number of leaf clusters. Must be > 1. Default: 4. The actual number could be smaller if there are no divisible leaf clusters.
  
  Specified by:
  
  k in interface BisectingKMeansParams
  
  Returns:
  
  (undocumented)
- minDivisibleClusterSize
  
  public final DoubleParam minDivisibleClusterSize()
  
  Description copied from interface: BisectingKMeansParams
  
  The minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1.0).
  
  Specified by:
  
  minDivisibleClusterSize in interface BisectingKMeansParams
  
  Returns:
  
  (undocumented)
- weightCol
  
  public final Param<String> weightCol()
  
  Description copied from interface: HasWeightCol
  
  Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.
  
  Specified by:
  
  weightCol in interface HasWeightCol
  
  Returns:
  
  (undocumented)
- distanceMeasure
  
  public final Param<String> distanceMeasure()
  
  Description copied from interface: HasDistanceMeasure
  
  Param for The distance measure. Supported options: 'euclidean' and 'cosine'.
  
  Specified by:
  
  distanceMeasure in interface HasDistanceMeasure
  
  Returns:
  
  (undocumented)
- predictionCol
  
  public final Param<String> predictionCol()
  
  Description copied from interface: HasPredictionCol
  
  Param for prediction column name.
  
  Specified by:
  
  predictionCol in interface HasPredictionCol
  
  Returns:
  
  (undocumented)
- seed
  
  public final LongParam seed()
  
  Description copied from interface: HasSeed
  
  Param for random seed.
  
  Specified by:
  
  seed in interface HasSeed
  
  Returns:
  
  (undocumented)
- featuresCol
  
  public final Param<String> featuresCol()
  
  Description copied from interface: HasFeaturesCol
  
  Param for features column name.
  
  Specified by:
  
  featuresCol in interface HasFeaturesCol
  
  Returns:
  
  (undocumented)
- maxIter
  
  public final IntParam maxIter()
  
  Description copied from interface: HasMaxIter
  
  Param for maximum number of iterations (>= 0).
  
  Specified by:
  
  maxIter in interface HasMaxIter
  
  Returns:
  
  (undocumented)
- uid
  
  public String uid()
  
  Description copied from interface: Identifiable
  
  An immutable unique ID for the object and its derivatives.
  
  Specified by:
  
  uid in interface Identifiable
  
  Returns:
  
  (undocumented)
- copy
  
  public BisectingKMeans copy(ParamMap extra)
  
  Description copied from interface: Params
  
  Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See defaultCopy().
  
  Specified by:
  
  copy in interface Params
  
  Specified by:
  
  copy in class Estimator<BisectingKMeansModel>
  
  Parameters:
  
  extra - (undocumented)
  
  Returns:
  
  (undocumented)
- setFeaturesCol
  
  public BisectingKMeans setFeaturesCol(String value)
- setPredictionCol
  
  public BisectingKMeans setPredictionCol(String value)
- setK
  
  public BisectingKMeans setK(int value)
- setMaxIter
  
  public BisectingKMeans setMaxIter(int value)
- setSeed
  
  public BisectingKMeans setSeed(long value)
- setMinDivisibleClusterSize
  
  public BisectingKMeans setMinDivisibleClusterSize(double value)
- setDistanceMeasure
  
  public BisectingKMeans setDistanceMeasure(String value)
- setWeightCol
  
  public BisectingKMeans setWeightCol(String value)
  
  Sets the value of param weightCol(). If this is not set or empty, we treat all instance weights as 1.0. Default is not set, so all instances have weight one.
  
  Parameters:
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
- fit
  
  public BisectingKMeansModel fit(Dataset<?> dataset)
  
  Description copied from class: Estimator
  
  Fits a model to the input data.
  
  Specified by:
  
  fit in class Estimator<BisectingKMeansModel>
  
  Parameters:
  
  dataset - (undocumented)
  
  Returns:
  
  (undocumented)
- transformSchema
  
  public StructType transformSchema(StructType schema)
  
  Description copied from class: PipelineStage
  
  Check transform validity and derive the output schema from the input schema.
  We check validity for interactions between parameters during transformSchema and raise an exception if any parameter value is invalid. Parameter value checks which do not depend on other parameters are handled by Param.validate().
  Typical implementation should first conduct verification on schema change and parameter validity, including complex parameter interaction checks.
  
  Specified by:
  
  transformSchema in class PipelineStage
  
  Parameters:
  
  schema - (undocumented)
  
  Returns:
  
  (undocumented)

Class BisectingKMeans

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

Constructor Summary

Method Summary

Methods inherited from class org.apache.spark.ml.Estimator

Methods inherited from class org.apache.spark.ml.PipelineStage

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.spark.ml.clustering.BisectingKMeansParams

Methods inherited from interface org.apache.spark.ml.util.DefaultParamsWritable

Methods inherited from interface org.apache.spark.ml.param.shared.HasDistanceMeasure

Methods inherited from interface org.apache.spark.ml.param.shared.HasFeaturesCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasMaxIter

Methods inherited from interface org.apache.spark.ml.param.shared.HasPredictionCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasSeed

Methods inherited from interface org.apache.spark.ml.param.shared.HasWeightCol

Methods inherited from interface org.apache.spark.ml.util.Identifiable

Methods inherited from interface org.apache.spark.internal.Logging

Methods inherited from interface org.apache.spark.ml.util.MLWritable

Methods inherited from interface org.apache.spark.ml.param.Params

Constructor Details

BisectingKMeans

BisectingKMeans

Method Details

load

read

k

minDivisibleClusterSize

weightCol

distanceMeasure

predictionCol

seed

featuresCol

maxIter

uid

copy

setFeaturesCol

setPredictionCol

setK

setMaxIter

setSeed

setMinDivisibleClusterSize

setDistanceMeasure

setWeightCol

fit

transformSchema