A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark.
Model fitted by BisectingKMeans.
:: Experimental :: Summary of BisectingKMeans.
:: Experimental :: Summary of clustering algorithms.
Distributed model fitted by LDA.
Gaussian Mixture clustering.
This algorithm is limited in its number of features since it requires storing a covariance matrix which has size quadratic in the number of features. Even when the number of features does not exceed this limit, this algorithm may perform poorly on high-dimensional data. This is due to high-dimensional data (a) making it difficult to cluster at all (based on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.
Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points are drawn from each Gaussian i with probability weights(i).
:: Experimental :: Summary of GaussianMixture.
K-means clustering with support for k-means|| initialization proposed by Bahmani et al.
Model fitted by KMeans.
:: Experimental :: Summary of KMeans.
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Model fitted by LDA.
Local (non-distributed) model fitted by LDA.
:: Experimental :: Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen.