MLlib (RDD-based)

Classification

LogisticRegressionModel(weights, intercept, …)

Classification model trained using Multinomial/Binary Logistic Regression.

LogisticRegressionWithSGD

Train a classification model for Binary Logistic Regression using Stochastic Gradient Descent.

LogisticRegressionWithLBFGS

Train a classification model for Multinomial/Binary Logistic Regression using Limited-memory BFGS.

SVMModel(weights, intercept)

Model for Support Vector Machines (SVMs).

SVMWithSGD

Train a Support Vector Machine (SVM) using Stochastic Gradient Descent.

NaiveBayesModel(labels, pi, theta)

Model for Naive Bayes classifiers.

NaiveBayes

Train a Multinomial Naive Bayes model.

StreamingLogisticRegressionWithSGD([…])

Train or predict a logistic regression model on streaming data.

Clustering

BisectingKMeansModel(java_model)

A clustering model derived from the bisecting k-means method.

BisectingKMeans

A bisecting k-means algorithm based on the paper “A comparison of document clustering techniques” by Steinbach, Karypis, and Kumar, with modification to fit Spark.

KMeansModel(centers)

A clustering model derived from the k-means method.

KMeans

K-means clustering.

GaussianMixtureModel(java_model)

A clustering model derived from the Gaussian Mixture Model method.

GaussianMixture

Learning algorithm for Gaussian Mixtures using the expectation-maximization algorithm.

PowerIterationClusteringModel(java_model)

Model produced by PowerIterationClustering.

PowerIterationClustering

Power Iteration Clustering (PIC), a scalable graph clustering algorithm.

StreamingKMeans([k, decayFactor, timeUnit])

Provides methods to set k, decayFactor, timeUnit to configure the KMeans algorithm for fitting and predicting on incoming dstreams.

StreamingKMeansModel(clusterCenters, …)

Clustering model which can perform an online update of the centroids.

LDA

Train Latent Dirichlet Allocation (LDA) model.

LDAModel(java_model)

A clustering model derived from the LDA method.

Evaluation

BinaryClassificationMetrics(scoreAndLabels)

Evaluator for binary classification.

RegressionMetrics(predictionAndObservations)

Evaluator for regression.

MulticlassMetrics(predictionAndLabels)

Evaluator for multiclass classification.

RankingMetrics(predictionAndLabels)

Evaluator for ranking algorithms.

Feature

Normalizer([p])

Normalizes samples individually to unit Lp norm

StandardScalerModel(java_model)

Represents a StandardScaler model that can transform vectors.

StandardScaler([withMean, withStd])

Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

HashingTF([numFeatures])

Maps a sequence of terms to their term frequencies using the hashing trick.

IDFModel(java_model)

Represents an IDF model that can transform term frequency vectors.

IDF([minDocFreq])

Inverse document frequency (IDF).

Word2Vec()

Word2Vec creates vector representation of words in a text corpus.

Word2VecModel(java_model)

class for Word2Vec model

ChiSqSelector([numTopFeatures, …])

Creates a ChiSquared feature selector.

ChiSqSelectorModel(java_model)

Represents a Chi Squared selector model.

ElementwiseProduct(scalingVector)

Scales each column of the vector, with the supplied weight vector.

Frequency Pattern Mining

FPGrowth

A Parallel FP-growth algorithm to mine frequent itemsets.

FPGrowthModel(java_model)

A FP-Growth model for mining frequent itemsets using the Parallel FP-Growth algorithm.

PrefixSpan

A parallel PrefixSpan algorithm to mine frequent sequential patterns.

PrefixSpanModel(java_model)

Model fitted by PrefixSpan

Vector and Matrix

Vector

DenseVector(ar)

A dense vector represented by a value array.

SparseVector(size, *args)

A simple sparse vector class for passing data to MLlib.

Vectors

Factory methods for working with vectors.

Matrix(numRows, numCols[, isTransposed])

DenseMatrix(numRows, numCols, values[, …])

Column-major dense matrix.

SparseMatrix(numRows, numCols, colPtrs, …)

Sparse Matrix stored in CSC format.

Matrices

QRDecomposition(Q, R)

Represents QR factors.

Distributed Representation

BlockMatrix(blocks, rowsPerBlock, colsPerBlock)

Represents a distributed matrix in blocks of local matrices.

CoordinateMatrix(entries[, numRows, numCols])

Represents a matrix in coordinate format.

DistributedMatrix

Represents a distributively stored matrix backed by one or more RDDs.

IndexedRow(index, vector)

Represents a row of an IndexedRowMatrix.

IndexedRowMatrix(rows[, numRows, numCols])

Represents a row-oriented distributed Matrix with indexed rows.

MatrixEntry(i, j, value)

Represents an entry of a CoordinateMatrix.

RowMatrix(rows[, numRows, numCols])

Represents a row-oriented distributed Matrix with no meaningful row indices.

SingularValueDecomposition(java_model)

Represents singular value decomposition (SVD) factors.

Random

RandomRDDs

Generator methods for creating RDDs comprised of i.i.d samples from some distribution.

Recommendation

MatrixFactorizationModel(java_model)

A matrix factorisation model trained by regularized alternating least-squares.

ALS

Alternating Least Squares matrix factorization

Rating

Represents a (user, product, rating) tuple.

Regression

LabeledPoint(label, features)

Class that represents the features and labels of a data point.

LinearModel(weights, intercept)

A linear model that has a vector of coefficients and an intercept.

LinearRegressionModel(weights, intercept)

A linear regression model derived from a least-squares fit.

LinearRegressionWithSGD

Train a linear regression model with no regularization using Stochastic Gradient Descent.

RidgeRegressionModel(weights, intercept)

A linear regression model derived from a least-squares fit with an l_2 penalty term.

RidgeRegressionWithSGD

Train a regression model with L2-regularization using Stochastic Gradient Descent.

LassoModel(weights, intercept)

A linear regression model derived from a least-squares fit with an l_1 penalty term.

LassoWithSGD

Train a regression model with L1-regularization using Stochastic Gradient Descent.

IsotonicRegressionModel(boundaries, …)

Regression model for isotonic regression.

IsotonicRegression

Isotonic regression.

StreamingLinearAlgorithm(model)

Base class that has to be inherited by any StreamingLinearAlgorithm.

StreamingLinearRegressionWithSGD([stepSize, …])

Train or predict a linear regression model on streaming data.

Statistics

Statistics

MultivariateStatisticalSummary(java_model)

Trait for multivariate statistical summary of a data matrix.

ChiSqTestResult(java_model)

Contains test results for the chi-squared hypothesis test.

MultivariateGaussian

Represents a (mu, sigma) tuple

KernelDensity()

Estimate probability density at required points given an RDD of samples from the population.

ChiSqTestResult(java_model)

Contains test results for the chi-squared hypothesis test.

KolmogorovSmirnovTestResult(java_model)

Contains test results for the Kolmogorov-Smirnov test.

Tree

DecisionTreeModel(java_model)

A decision tree model for classification or regression.

DecisionTree

Learning algorithm for a decision tree model for classification or regression.

RandomForestModel(java_model)

Represents a random forest model.

RandomForest

Learning algorithm for a random forest model for classification or regression.

GradientBoostedTreesModel(java_model)

Represents a gradient-boosted tree model.

GradientBoostedTrees

Learning algorithm for a gradient boosted trees model for classification or regression.

Utilities

JavaLoader

Mixin for classes which can load saved models using its Scala implementation.

JavaSaveable

Mixin for models that provide save() through their Scala implementation.

LinearDataGenerator

Utils for generating linear data.

Loader

Mixin for classes which can load saved models from files.

MLUtils

Helper methods to load, save and pre-process data used in MLlib.

Saveable

Mixin for models and transformers which may be saved as files.