clustering

package clustering

Ordering

Alphabetic

Visibility

Public
Protected

Type Members

class BisectingKMeans extends Logging
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark.
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.
Annotations
@Since("1.6.0")
See also
Steinbach, Karypis, and Kumar, A comparison of document clustering techniques, KDD Workshop on Text Mining, 2000.
class BisectingKMeansModel extends Serializable with Saveable with Logging
Clustering model produced by BisectingKMeans.
Clustering model produced by BisectingKMeans. The prediction is done level-by-level from the root node to a leaf node, and at each node among its children the closest to the input point is selected.
Annotations
@Since("1.6.0")
class DistributedLDAModel extends LDAModel
Distributed LDA model.
Distributed LDA model. This model stores the inferred topics, the full training dataset, and the topic distributions.
Annotations
@Since("1.3.0")
final class EMLDAOptimizer extends LDAOptimizer
Optimizer for EM algorithm which stores data + parameter graph, plus algorithm parameters.
Optimizer for EM algorithm which stores data + parameter graph, plus algorithm parameters.
Currently, the underlying implementation uses Expectation-Maximization (EM), implemented according to the Asuncion et al. (2009) paper referenced below.
References:
- Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
  - This class implements their "smoothed" LDA model.
- Paper which clearly explains several algorithms, including EM: Asuncion, Welling, Smyth, and Teh. "On Smoothing and Inference for Topic Models." UAI, 2009.
Annotations
@Since("1.4.0")
class GaussianMixture extends Serializable
This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs).
This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite.
Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.
Annotations
@Since("1.3.0")
Note
This algorithm is limited in its number of features since it requires storing a covariance matrix which has size quadratic in the number of features. Even when the number of features does not exceed this limit, this algorithm may perform poorly on high-dimensional data. This is due to high-dimensional data (a) making it difficult to cluster at all (based on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.
class GaussianMixtureModel extends Serializable with Saveable
Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are the respective mean and covariance for each Gaussian distribution i=1..k.
Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are the respective mean and covariance for each Gaussian distribution i=1..k.
Annotations
@Since("1.3.0")
class KMeans extends Serializable with Logging
K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al).
K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al).
This is an iterative algorithm that will make multiple passes over the data, so any RDDs given to it should be cached by the user.
Annotations
@Since("0.8.0")
class KMeansModel extends Saveable with Serializable with PMMLExportable
A clustering model for K-means.
A clustering model for K-means. Each point belongs to the cluster with the closest center.
Annotations
@Since("0.8.0")
class LDA extends Logging
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Terminology:
- "word" = "term": an element of the vocabulary
- "token": instance of a term appearing in a document
- "topic": multinomial distribution over words representing some concept
References:
- Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
Annotations
@Since("1.3.0")
See also
Latent Dirichlet allocation (Wikipedia)
abstract class LDAModel extends Saveable
Latent Dirichlet Allocation (LDA) model.
Latent Dirichlet Allocation (LDA) model.
This abstraction permits for different underlying representations, including local and distributed data structures.
Annotations
@Since("1.3.0")
trait LDAOptimizer extends AnyRef
An LDAOptimizer specifies which optimization/learning/inference algorithm to use, and it can hold optimizer-specific parameters for users to set.
An LDAOptimizer specifies which optimization/learning/inference algorithm to use, and it can hold optimizer-specific parameters for users to set.
Annotations
@Since("1.4.0")
class LocalLDAModel extends LDAModel with Serializable
Local LDA model.
Local LDA model. This model stores only the inferred topics.
Annotations
@Since("1.3.0")
final class OnlineLDAOptimizer extends LDAOptimizer with Logging
An online optimizer for LDA.
An online optimizer for LDA. The Optimizer implements the Online variational Bayes LDA algorithm, which processes a subset of the corpus on each iteration, and updates the term-topic distribution adaptively.
Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010.
Annotations
@Since("1.4.0")
class PowerIterationClustering extends Serializable
Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen.
Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.
Annotations
@Since("1.3.0")
See also
Spectral clustering (Wikipedia)
class PowerIterationClusteringModel extends Saveable with Serializable
Model produced by PowerIterationClustering.
Model produced by PowerIterationClustering.
Annotations
@Since("1.3.0")
class StreamingKMeans extends Logging with Serializable
StreamingKMeans provides methods for configuring a streaming k-means analysis, training the model on streaming, and using the model to make predictions on streaming data.
StreamingKMeans provides methods for configuring a streaming k-means analysis, training the model on streaming, and using the model to make predictions on streaming data. See KMeansModel for details on algorithm and update rules.
Use a builder pattern to construct a streaming k-means analysis in an application, like:
```
val model = new StreamingKMeans()
  .setDecayFactor(0.5)
  .setK(3)
  .setRandomCenters(5, 100.0)
  .trainOn(DStream)
```
Annotations
@Since("1.2.0")
class StreamingKMeansModel extends KMeansModel with Logging
StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.
StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.
The update algorithm uses the "mini-batch" KMeans rule, generalized to incorporate forgetfulness (i.e. decay). The update rule (for each cluster) is:
$$ \begin{align} c_{t+1} &= [(c_t * n_t * a) + (x_t * m_t)] / [n_t + m_t] \\ n_{t+1} &= n_t * a + m_t \end{align} $$
Where c_t is the previously estimated centroid for that cluster, n_t is the number of points assigned to it thus far, x_t is the centroid estimated on the current batch, and m_t is the number of points assigned to that centroid in the current batch.
The decay factor 'a' scales the contribution of the clusters as estimated thus far, by applying a as a discount weighting on the current point when evaluating new incoming data. If a=1, all batches are weighted equally. If a=0, new centroids are determined entirely by recent data. Lower values correspond to more forgetting.
Decay can optionally be specified by a half life and associated time unit. The time unit can either be a batch of data or a single data point. Considering data arrived at time t, the half life h is defined such that at time t + h the discount applied to the data from t is 0.5. The definition remains the same whether the time unit is given as batches or points.
Annotations
@Since("1.2.0")

Value Members

object BisectingKMeansModel extends Loader[BisectingKMeansModel] with Serializable
Annotations
@Since("2.0.0")
object DistanceMeasure extends Serializable
Annotations
@Since("2.4.0")
object DistributedLDAModel extends Loader[DistributedLDAModel]
Distributed model fitted by LDA.
Distributed model fitted by LDA. This type of model is currently only produced by Expectation-Maximization (EM).
This model stores the inferred topics, the full training dataset, and the topic distribution for each training document.
Annotations
@Since("1.5.0")
object GaussianMixtureModel extends Loader[GaussianMixtureModel] with Serializable
Annotations
@Since("1.4.0")
object KMeans extends Serializable
Top-level methods for calling K-means clustering.
Top-level methods for calling K-means clustering.
Annotations
@Since("0.8.0")
object KMeansModel extends Loader[KMeansModel] with Serializable
Annotations
@Since("1.4.0")
object LocalLDAModel extends Loader[LocalLDAModel] with Serializable
Local (non-distributed) model fitted by LDA.
Local (non-distributed) model fitted by LDA.
This model stores the inferred topics only; it does not store info about the training dataset.
Annotations
@Since("1.5.0")
object PowerIterationClustering extends Logging with Serializable
Annotations
@Since("1.3.0")
object PowerIterationClusteringModel extends Loader[PowerIterationClusteringModel] with Serializable
Annotations
@Since("1.4.0")

Ungrouped

class BisectingKMeans extends Logging
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark.
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.
Annotations
@Since("1.6.0")
See also
Steinbach, Karypis, and Kumar, A comparison of document clustering techniques, KDD Workshop on Text Mining, 2000.
class BisectingKMeansModel extends Serializable with Saveable with Logging
Clustering model produced by BisectingKMeans.
Clustering model produced by BisectingKMeans. The prediction is done level-by-level from the root node to a leaf node, and at each node among its children the closest to the input point is selected.
Annotations
@Since("1.6.0")
class DistributedLDAModel extends LDAModel
Distributed LDA model.
Distributed LDA model. This model stores the inferred topics, the full training dataset, and the topic distributions.
Annotations
@Since("1.3.0")
final class EMLDAOptimizer extends LDAOptimizer
Optimizer for EM algorithm which stores data + parameter graph, plus algorithm parameters.
Optimizer for EM algorithm which stores data + parameter graph, plus algorithm parameters.
Currently, the underlying implementation uses Expectation-Maximization (EM), implemented according to the Asuncion et al. (2009) paper referenced below.
References:
- Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
  - This class implements their "smoothed" LDA model.
- Paper which clearly explains several algorithms, including EM: Asuncion, Welling, Smyth, and Teh. "On Smoothing and Inference for Topic Models." UAI, 2009.
Annotations
@Since("1.4.0")
class GaussianMixture extends Serializable
This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs).
This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite.
Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.
Annotations
@Since("1.3.0")
Note
This algorithm is limited in its number of features since it requires storing a covariance matrix which has size quadratic in the number of features. Even when the number of features does not exceed this limit, this algorithm may perform poorly on high-dimensional data. This is due to high-dimensional data (a) making it difficult to cluster at all (based on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.
class GaussianMixtureModel extends Serializable with Saveable
Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are the respective mean and covariance for each Gaussian distribution i=1..k.
Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are the respective mean and covariance for each Gaussian distribution i=1..k.
Annotations
@Since("1.3.0")
class KMeans extends Serializable with Logging
K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al).
K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al).
This is an iterative algorithm that will make multiple passes over the data, so any RDDs given to it should be cached by the user.
Annotations
@Since("0.8.0")
class KMeansModel extends Saveable with Serializable with PMMLExportable
A clustering model for K-means.
A clustering model for K-means. Each point belongs to the cluster with the closest center.
Annotations
@Since("0.8.0")
class LDA extends Logging
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Terminology:
- "word" = "term": an element of the vocabulary
- "token": instance of a term appearing in a document
- "topic": multinomial distribution over words representing some concept
References:
- Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
Annotations
@Since("1.3.0")
See also
Latent Dirichlet allocation (Wikipedia)
abstract class LDAModel extends Saveable
Latent Dirichlet Allocation (LDA) model.
Latent Dirichlet Allocation (LDA) model.
This abstraction permits for different underlying representations, including local and distributed data structures.
Annotations
@Since("1.3.0")
trait LDAOptimizer extends AnyRef
An LDAOptimizer specifies which optimization/learning/inference algorithm to use, and it can hold optimizer-specific parameters for users to set.
An LDAOptimizer specifies which optimization/learning/inference algorithm to use, and it can hold optimizer-specific parameters for users to set.
Annotations
@Since("1.4.0")
class LocalLDAModel extends LDAModel with Serializable
Local LDA model.
Local LDA model. This model stores only the inferred topics.
Annotations
@Since("1.3.0")
final class OnlineLDAOptimizer extends LDAOptimizer with Logging
An online optimizer for LDA.
An online optimizer for LDA. The Optimizer implements the Online variational Bayes LDA algorithm, which processes a subset of the corpus on each iteration, and updates the term-topic distribution adaptively.
Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010.
Annotations
@Since("1.4.0")
class PowerIterationClustering extends Serializable
Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen.
Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.
Annotations
@Since("1.3.0")
See also
Spectral clustering (Wikipedia)
class PowerIterationClusteringModel extends Saveable with Serializable
Model produced by PowerIterationClustering.
Model produced by PowerIterationClustering.
Annotations
@Since("1.3.0")
class StreamingKMeans extends Logging with Serializable
StreamingKMeans provides methods for configuring a streaming k-means analysis, training the model on streaming, and using the model to make predictions on streaming data.
StreamingKMeans provides methods for configuring a streaming k-means analysis, training the model on streaming, and using the model to make predictions on streaming data. See KMeansModel for details on algorithm and update rules.
Use a builder pattern to construct a streaming k-means analysis in an application, like:
```
val model = new StreamingKMeans()
  .setDecayFactor(0.5)
  .setK(3)
  .setRandomCenters(5, 100.0)
  .trainOn(DStream)
```
Annotations
@Since("1.2.0")
class StreamingKMeansModel extends KMeansModel with Logging
StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.
StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.
The update algorithm uses the "mini-batch" KMeans rule, generalized to incorporate forgetfulness (i.e. decay). The update rule (for each cluster) is:
$$ \begin{align} c_{t+1} &= [(c_t * n_t * a) + (x_t * m_t)] / [n_t + m_t] \\ n_{t+1} &= n_t * a + m_t \end{align} $$
Where c_t is the previously estimated centroid for that cluster, n_t is the number of points assigned to it thus far, x_t is the centroid estimated on the current batch, and m_t is the number of points assigned to that centroid in the current batch.
The decay factor 'a' scales the contribution of the clusters as estimated thus far, by applying a as a discount weighting on the current point when evaluating new incoming data. If a=1, all batches are weighted equally. If a=0, new centroids are determined entirely by recent data. Lower values correspond to more forgetting.
Decay can optionally be specified by a half life and associated time unit. The time unit can either be a batch of data or a single data point. Considering data arrived at time t, the half life h is defined such that at time t + h the discount applied to the data from t is 0.5. The definition remains the same whether the time unit is given as batches or points.
Annotations
@Since("1.2.0")

object BisectingKMeansModel extends Loader[BisectingKMeansModel] with Serializable
Annotations
@Since("2.0.0")
object DistanceMeasure extends Serializable
Annotations
@Since("2.4.0")
object DistributedLDAModel extends Loader[DistributedLDAModel]
Distributed model fitted by LDA.
Distributed model fitted by LDA. This type of model is currently only produced by Expectation-Maximization (EM).
This model stores the inferred topics, the full training dataset, and the topic distribution for each training document.
Annotations
@Since("1.5.0")
object GaussianMixtureModel extends Loader[GaussianMixtureModel] with Serializable
Annotations
@Since("1.4.0")
object KMeans extends Serializable
Top-level methods for calling K-means clustering.
Top-level methods for calling K-means clustering.
Annotations
@Since("0.8.0")
object KMeansModel extends Loader[KMeansModel] with Serializable
Annotations
@Since("1.4.0")
object LocalLDAModel extends Loader[LocalLDAModel] with Serializable
Local (non-distributed) model fitted by LDA.
Local (non-distributed) model fitted by LDA.
This model stores the inferred topics only; it does not store info about the training dataset.
Annotations
@Since("1.5.0")
object PowerIterationClustering extends Logging with Serializable
Annotations
@Since("1.3.0")
object PowerIterationClusteringModel extends Loader[PowerIterationClusteringModel] with Serializable
Annotations
@Since("1.4.0")

Packages

clustering

package clustering

Type Members

Value Members

Ungrouped

clustering