MLlib (RDD-based)#

Classification#

LogisticRegressionModel(weights, intercept, ...)

Classification model trained using Multinomial/Binary Logistic Regression.

LogisticRegressionWithSGD()

Train a classification model for Binary Logistic Regression using Stochastic Gradient Descent.

LogisticRegressionWithLBFGS()

Train a classification model for Multinomial/Binary Logistic Regression using Limited-memory BFGS.

SVMModel(weights, intercept)

Model for Support Vector Machines (SVMs).

SVMWithSGD()

Train a Support Vector Machine (SVM) using Stochastic Gradient Descent.

NaiveBayesModel(labels, pi, theta)

Model for Naive Bayes classifiers.

NaiveBayes()

Train a Multinomial Naive Bayes model.

StreamingLogisticRegressionWithSGD([...])

Train or predict a logistic regression model on streaming data.

Clustering#

BisectingKMeansModel(java_model)

A clustering model derived from the bisecting k-means method.

BisectingKMeans()

A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark.

KMeansModel(centers)

A clustering model derived from the k-means method.

KMeans()

K-means clustering.

GaussianMixtureModel(java_model)

A clustering model derived from the Gaussian Mixture Model method.

GaussianMixture()

Learning algorithm for Gaussian Mixtures using the expectation-maximization algorithm.

PowerIterationClusteringModel(java_model)

Model produced by PowerIterationClustering.

PowerIterationClustering()

Power Iteration Clustering (PIC), a scalable graph clustering algorithm.

StreamingKMeans([k, decayFactor, timeUnit])

Provides methods to set k, decayFactor, timeUnit to configure the KMeans algorithm for fitting and predicting on incoming dstreams.

StreamingKMeansModel(clusterCenters, ...)

Clustering model which can perform an online update of the centroids.

LDA()

Train Latent Dirichlet Allocation (LDA) model.

LDAModel(java_model)

A clustering model derived from the LDA method.

Evaluation#

BinaryClassificationMetrics(scoreAndLabels)

Evaluator for binary classification.

RegressionMetrics(predictionAndObservations)

Evaluator for regression.

MulticlassMetrics(predictionAndLabels)

Evaluator for multiclass classification.

RankingMetrics(predictionAndLabels)

Evaluator for ranking algorithms.

Feature#

Normalizer([p])

Normalizes samples individually to unit Lp norm

StandardScalerModel(java_model)

Represents a StandardScaler model that can transform vectors.

StandardScaler([withMean, withStd])

Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

HashingTF([numFeatures])

Maps a sequence of terms to their term frequencies using the hashing trick.

IDFModel(java_model)

Represents an IDF model that can transform term frequency vectors.

IDF([minDocFreq])

Inverse document frequency (IDF).

Word2Vec()

Word2Vec creates vector representation of words in a text corpus.

Word2VecModel(java_model)

class for Word2Vec model

ChiSqSelector([numTopFeatures, ...])

Creates a ChiSquared feature selector.

ChiSqSelectorModel(java_model)

Represents a Chi Squared selector model.

ElementwiseProduct(scalingVector)

Scales each column of the vector, with the supplied weight vector.

Frequency Pattern Mining#

FPGrowth()

A Parallel FP-growth algorithm to mine frequent itemsets.

FPGrowthModel(java_model)

A FP-Growth model for mining frequent itemsets using the Parallel FP-Growth algorithm.

PrefixSpan()

A parallel PrefixSpan algorithm to mine frequent sequential patterns.

PrefixSpanModel(java_model)

Model fitted by PrefixSpan

Vector and Matrix#

Vector()

DenseVector(ar)

A dense vector represented by a value array.

SparseVector(size, *args)

A simple sparse vector class for passing data to MLlib.

Vectors()

Factory methods for working with vectors.

Matrix(numRows, numCols[, isTransposed])

DenseMatrix(numRows, numCols, values[, ...])

Column-major dense matrix.

SparseMatrix(numRows, numCols, colPtrs, ...)

Sparse Matrix stored in CSC format.

Matrices()

QRDecomposition(Q, R)

Represents QR factors.

Distributed Representation#

BlockMatrix(blocks, rowsPerBlock, colsPerBlock)

Represents a distributed matrix in blocks of local matrices.

CoordinateMatrix(entries[, numRows, numCols])

Represents a matrix in coordinate format.

DistributedMatrix()

Represents a distributively stored matrix backed by one or more RDDs.

IndexedRow(index, vector)

Represents a row of an IndexedRowMatrix.

IndexedRowMatrix(rows[, numRows, numCols])

Represents a row-oriented distributed Matrix with indexed rows.

MatrixEntry(i, j, value)

Represents an entry of a CoordinateMatrix.

RowMatrix(rows[, numRows, numCols])

Represents a row-oriented distributed Matrix with no meaningful row indices.

SingularValueDecomposition(java_model)

Represents singular value decomposition (SVD) factors.

Random#

RandomRDDs()

Generator methods for creating RDDs comprised of i.i.d samples from some distribution.

Recommendation#

MatrixFactorizationModel(java_model)

A matrix factorisation model trained by regularized alternating least-squares.

ALS()

Alternating Least Squares matrix factorization

Rating(user, product, rating)

Represents a (user, product, rating) tuple.

Regression#

LabeledPoint(label, features)

Class that represents the features and labels of a data point.

LinearModel(weights, intercept)

A linear model that has a vector of coefficients and an intercept.

LinearRegressionModel(weights, intercept)

A linear regression model derived from a least-squares fit.

LinearRegressionWithSGD()

Train a linear regression model with no regularization using Stochastic Gradient Descent.

RidgeRegressionModel(weights, intercept)

A linear regression model derived from a least-squares fit with an l_2 penalty term.

RidgeRegressionWithSGD()

Train a regression model with L2-regularization using Stochastic Gradient Descent.

LassoModel(weights, intercept)

A linear regression model derived from a least-squares fit with an l_1 penalty term.

LassoWithSGD()

Train a regression model with L1-regularization using Stochastic Gradient Descent.

IsotonicRegressionModel(boundaries, ...)

Regression model for isotonic regression.

IsotonicRegression()

Isotonic regression.

StreamingLinearAlgorithm(model)

Base class that has to be inherited by any StreamingLinearAlgorithm.

StreamingLinearRegressionWithSGD([stepSize, ...])

Train or predict a linear regression model on streaming data.

Statistics#

Statistics()

MultivariateStatisticalSummary(java_model)

Trait for multivariate statistical summary of a data matrix.

ChiSqTestResult(java_model)

Contains test results for the chi-squared hypothesis test.

MultivariateGaussian(mu, sigma)

Represents a (mu, sigma) tuple

KernelDensity()

Estimate probability density at required points given an RDD of samples from the population.

ChiSqTestResult(java_model)

Contains test results for the chi-squared hypothesis test.

KolmogorovSmirnovTestResult(java_model)

Contains test results for the Kolmogorov-Smirnov test.

Tree#

DecisionTreeModel(java_model)

A decision tree model for classification or regression.

DecisionTree()

Learning algorithm for a decision tree model for classification or regression.

RandomForestModel(java_model)

Represents a random forest model.

RandomForest()

Learning algorithm for a random forest model for classification or regression.

GradientBoostedTreesModel(java_model)

Represents a gradient-boosted tree model.

GradientBoostedTrees()

Learning algorithm for a gradient boosted trees model for classification or regression.

Utilities#

JavaLoader()

Mixin for classes which can load saved models using its Scala implementation.

JavaSaveable()

Mixin for models that provide save() through their Scala implementation.

LinearDataGenerator()

Utils for generating linear data.

Loader()

Mixin for classes which can load saved models from files.

MLUtils()

Helper methods to load, save and pre-process data used in MLlib.

Saveable()

Mixin for models and transformers which may be saved as files.