# MLlib (RDD-based)¶

## Classification¶

 LogisticRegressionModel(weights, intercept, …) Classification model trained using Multinomial/Binary Logistic Regression. LogisticRegressionWithSGD Train a classification model for Binary Logistic Regression using Stochastic Gradient Descent. LogisticRegressionWithLBFGS Train a classification model for Multinomial/Binary Logistic Regression using Limited-memory BFGS. SVMModel(weights, intercept) Model for Support Vector Machines (SVMs). SVMWithSGD Train a Support Vector Machine (SVM) using Stochastic Gradient Descent. NaiveBayesModel(labels, pi, theta) Model for Naive Bayes classifiers. NaiveBayes Train a Multinomial Naive Bayes model. Train or predict a logistic regression model on streaming data.

## Clustering¶

 BisectingKMeansModel(java_model) A clustering model derived from the bisecting k-means method. BisectingKMeans A bisecting k-means algorithm based on the paper “A comparison of document clustering techniques” by Steinbach, Karypis, and Kumar, with modification to fit Spark. KMeansModel(centers) A clustering model derived from the k-means method. KMeans K-means clustering. GaussianMixtureModel(java_model) A clustering model derived from the Gaussian Mixture Model method. GaussianMixture Learning algorithm for Gaussian Mixtures using the expectation-maximization algorithm. PowerIterationClusteringModel(java_model) Model produced by PowerIterationClustering. PowerIterationClustering Power Iteration Clustering (PIC), a scalable graph clustering algorithm. StreamingKMeans([k, decayFactor, timeUnit]) Provides methods to set k, decayFactor, timeUnit to configure the KMeans algorithm for fitting and predicting on incoming dstreams. StreamingKMeansModel(clusterCenters, …) Clustering model which can perform an online update of the centroids. LDA Train Latent Dirichlet Allocation (LDA) model. LDAModel(java_model) A clustering model derived from the LDA method.

## Evaluation¶

 BinaryClassificationMetrics(scoreAndLabels) Evaluator for binary classification. RegressionMetrics(predictionAndObservations) Evaluator for regression. MulticlassMetrics(predictionAndLabels) Evaluator for multiclass classification. RankingMetrics(predictionAndLabels) Evaluator for ranking algorithms.

## Feature¶

 Normalizer([p]) Normalizes samples individually to unit Lp norm StandardScalerModel(java_model) Represents a StandardScaler model that can transform vectors. StandardScaler([withMean, withStd]) Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. HashingTF([numFeatures]) Maps a sequence of terms to their term frequencies using the hashing trick. IDFModel(java_model) Represents an IDF model that can transform term frequency vectors. IDF([minDocFreq]) Inverse document frequency (IDF). Word2Vec creates vector representation of words in a text corpus. Word2VecModel(java_model) class for Word2Vec model ChiSqSelector([numTopFeatures, …]) Creates a ChiSquared feature selector. ChiSqSelectorModel(java_model) Represents a Chi Squared selector model. ElementwiseProduct(scalingVector) Scales each column of the vector, with the supplied weight vector.

## Frequency Pattern Mining¶

 FPGrowth A Parallel FP-growth algorithm to mine frequent itemsets. FPGrowthModel(java_model) A FP-Growth model for mining frequent itemsets using the Parallel FP-Growth algorithm. PrefixSpan A parallel PrefixSpan algorithm to mine frequent sequential patterns. PrefixSpanModel(java_model) Model fitted by PrefixSpan

## Vector and Matrix¶

 Vector A dense vector represented by a value array. SparseVector(size, *args) A simple sparse vector class for passing data to MLlib. Vectors Factory methods for working with vectors. Matrix(numRows, numCols[, isTransposed]) DenseMatrix(numRows, numCols, values[, …]) Column-major dense matrix. SparseMatrix(numRows, numCols, colPtrs, …) Sparse Matrix stored in CSC format. Matrices Represents QR factors.

### Distributed Representation¶

 BlockMatrix(blocks, rowsPerBlock, colsPerBlock) Represents a distributed matrix in blocks of local matrices. CoordinateMatrix(entries[, numRows, numCols]) Represents a matrix in coordinate format. DistributedMatrix Represents a distributively stored matrix backed by one or more RDDs. IndexedRow(index, vector) Represents a row of an IndexedRowMatrix. IndexedRowMatrix(rows[, numRows, numCols]) Represents a row-oriented distributed Matrix with indexed rows. MatrixEntry(i, j, value) Represents an entry of a CoordinateMatrix. RowMatrix(rows[, numRows, numCols]) Represents a row-oriented distributed Matrix with no meaningful row indices. SingularValueDecomposition(java_model) Represents singular value decomposition (SVD) factors.

## Random¶

 RandomRDDs Generator methods for creating RDDs comprised of i.i.d samples from some distribution.

## Recommendation¶

 MatrixFactorizationModel(java_model) A matrix factorisation model trained by regularized alternating least-squares. ALS Alternating Least Squares matrix factorization Rating Represents a (user, product, rating) tuple.

## Regression¶

 LabeledPoint(label, features) Class that represents the features and labels of a data point. LinearModel(weights, intercept) A linear model that has a vector of coefficients and an intercept. LinearRegressionModel(weights, intercept) A linear regression model derived from a least-squares fit. LinearRegressionWithSGD Train a linear regression model with no regularization using Stochastic Gradient Descent. RidgeRegressionModel(weights, intercept) A linear regression model derived from a least-squares fit with an l_2 penalty term. RidgeRegressionWithSGD Train a regression model with L2-regularization using Stochastic Gradient Descent. LassoModel(weights, intercept) A linear regression model derived from a least-squares fit with an l_1 penalty term. LassoWithSGD Train a regression model with L1-regularization using Stochastic Gradient Descent. IsotonicRegressionModel(boundaries, …) Regression model for isotonic regression. IsotonicRegression Isotonic regression. Base class that has to be inherited by any StreamingLinearAlgorithm. StreamingLinearRegressionWithSGD([stepSize, …]) Train or predict a linear regression model on streaming data.

## Statistics¶

 Statistics MultivariateStatisticalSummary(java_model) Trait for multivariate statistical summary of a data matrix. ChiSqTestResult(java_model) Contains test results for the chi-squared hypothesis test. MultivariateGaussian Represents a (mu, sigma) tuple Estimate probability density at required points given an RDD of samples from the population. ChiSqTestResult(java_model) Contains test results for the chi-squared hypothesis test. KolmogorovSmirnovTestResult(java_model) Contains test results for the Kolmogorov-Smirnov test.

## Tree¶

 DecisionTreeModel(java_model) A decision tree model for classification or regression. DecisionTree Learning algorithm for a decision tree model for classification or regression. RandomForestModel(java_model) Represents a random forest model. RandomForest Learning algorithm for a random forest model for classification or regression. GradientBoostedTreesModel(java_model) Represents a gradient-boosted tree model. GradientBoostedTrees Learning algorithm for a gradient boosted trees model for classification or regression.

## Utilities¶

 JavaLoader Mixin for classes which can load saved models using its Scala implementation. JavaSaveable Mixin for models that provide save() through their Scala implementation. LinearDataGenerator Utils for generating linear data. Loader Mixin for classes which can load saved models from files. MLUtils Helper methods to load, save and pre-process data used in MLlib. Saveable Mixin for models and transformers which may be saved as files.