A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark.
Model fitted by BisectingKMeans.
:: Experimental :: Summary of BisectingKMeans.
:: Experimental :: Summary of clustering algorithms.
Distributed model fitted by LDA.
Gaussian Mixture clustering.
For high-dimensional data (with many features), this algorithm may perform poorly. This is due to high-dimensional data (a) making it difficult to cluster at all (based on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.
Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points are drawn from each Gaussian i with probability weights(i).
:: Experimental :: Summary of GaussianMixture.
K-means clustering with support for k-means|| initialization proposed by Bahmani et al.
Model fitted by KMeans.
:: Experimental :: Summary of KMeans.
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Model fitted by LDA.
Local (non-distributed) model fitted by LDA.