# Machine Learning Library (MLlib)

MLlib is a Spark implementation of some common machine learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives:

- Basics
- data types
- summary statistics

- Classification and regression
- Collaborative filtering
- alternating least squares (ALS)

- Clustering
- k-means

- Dimensionality reduction
- singular value decomposition (SVD)
- principal component analysis (PCA)

- Optimization
- stochastic gradient descent
- limited-memory BFGS (L-BFGS)

MLlib is a new component under active development.
The APIs marked `Experimental`

/`DeveloperApi`

may change in future releases,
and we will provide migration guide between releases.

# Dependencies

MLlib uses linear algebra packages Breeze, which depends on
netlib-java, and
jblas.
`netlib-java`

and `jblas`

depend on native Fortran routines.
You need to install the
gfortran runtime library if it is not
already present on your nodes. MLlib will throw a linking error if it cannot detect these libraries
automatically. Due to license issues, we do not include `netlib-java`

’s native libraries in MLlib’s
dependency set. If no native library is available at runtime, you will see a warning message. To
use native libraries from `netlib-java`

, please include artifact
`com.github.fommil.netlib:all:1.1.2`

as a dependency of your project or build your own (see
instructions).

To use MLlib in Python, you will need NumPy version 1.4 or newer.

# Migration Guide

## From 0.9 to 1.0

In MLlib v1.0, we support both dense and sparse input in a unified way, which introduces a few breaking changes. If your data is sparse, please store it in a sparse format instead of dense to take advantage of sparsity in both storage and computation.

We used to represent a feature vector by `Array[Double]`

, which is replaced by
`Vector`

in v1.0. Algorithms that used
to accept `RDD[Array[Double]]`

now take
`RDD[Vector]`

. `LabeledPoint`

is now a wrapper of `(Double, Vector)`

instead of `(Double, Array[Double])`

. Converting
`Array[Double]`

to `Vector`

is straightforward:

```
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val array: Array[Double] = ... // a double array
val vector: Vector = Vectors.dense(array) // a dense vector
```

`Vectors`

provides factory methods to create sparse vectors.

*Note*. Scala imports `scala.collection.immutable.Vector`

by default, so you have to import `org.apache.spark.mllib.linalg.Vector`

explicitly to use MLlib’s `Vector`

.

We used to represent a feature vector by `double[]`

, which is replaced by
`Vector`

in v1.0. Algorithms that used
to accept `RDD<double[]>`

now take
`RDD<Vector>`

. `LabeledPoint`

is now a wrapper of `(double, Vector)`

instead of `(double, double[])`

. Converting `double[]`

to
`Vector`

is straightforward:

```
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
double[] array = ... // a double array
Vector vector = Vectors.dense(array); // a dense vector
```

`Vectors`

provides factory methods to
create sparse vectors.

We used to represent a labeled feature vector in a NumPy array, where the first entry corresponds to
the label and the rest are features. This representation is replaced by class
`LabeledPoint`

, which takes both
dense and sparse feature vectors.

```
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
# Create a labeled point with a negative label and a sparse feature vector.
neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
```