See: Description
Class  Description 

Binarizer 
Binarize a column of continuous features given a threshold.

BucketedRandomProjectionLSH 
This
BucketedRandomProjectionLSH implements Locality Sensitive Hashing functions for
Euclidean distance metrics. 
BucketedRandomProjectionLSHModel 
Model produced by
BucketedRandomProjectionLSH , where multiple random vectors are stored. 
Bucketizer 
Bucketizer maps a column of continuous features to a column of feature buckets. 
ChiSqSelector  Deprecated
use UnivariateFeatureSelector instead.

ChiSqSelectorModel 
Model fitted by
ChiSqSelector . 
ChiSqSelectorModel.ChiSqSelectorModelWriter  
ColumnPruner 
Utility transformer for removing temporary columns from a DataFrame.

CountVectorizer 
Extracts a vocabulary from document collections and generates a
CountVectorizerModel . 
CountVectorizerModel 
Converts a text document to a sparse vector of token counts.

DCT 
A feature transformer that takes the 1D discrete cosine transform of a real vector.

Dot  
ElementwiseProduct 
Outputs the Hadamard product (i.e., the elementwise product) of each input vector with a
provided "weight" vector.

EmptyTerm 
Placeholder term for the result of undefined interactions, e.g.

FeatureHasher 
Feature hashing projects a set of categorical or numerical features into a feature vector of
specified dimension (typically substantially smaller than that of the original feature
space).

HashingTF 
Maps a sequence of terms to their term frequencies using the hashing trick.

IDF 
Compute the Inverse Document Frequency (IDF) given a collection of documents.

IDFModel 
Model fitted by
IDF . 
Imputer 
Imputation estimator for completing missing values, using the mean, median or mode
of the columns in which the missing values are located.

ImputerModel 
Model fitted by
Imputer . 
IndexToString 
A
Transformer that maps a column of indices back to a new column of corresponding
string values. 
Interaction 
Implements the feature interaction transform.

LabeledPoint 
Class that represents the features and label of a data point.

MaxAbsScaler 
Rescale each feature individually to range [1, 1] by dividing through the largest maximum
absolute value in each feature.

MaxAbsScalerModel 
Model fitted by
MaxAbsScaler . 
MinHashLSH 
LSH class for Jaccard distance.

MinHashLSHModel 
Model produced by
MinHashLSH , where multiple hash functions are stored. 
MinMaxScaler 
Rescale each feature individually to a common range [min, max] linearly using column summary
statistics, which is also known as minmax normalization or Rescaling.

MinMaxScalerModel 
Model fitted by
MinMaxScaler . 
NGram 
A feature transformer that converts the input array of strings into an array of ngrams.

Normalizer 
Normalize a vector to have unit norm using the given pnorm.

OneHotEncoder 
A onehot encoder that maps a column of category indices to a column of binary vectors, with
at most a single onevalue per row that indicates the input category index.

OneHotEncoderCommon 
Provides some helper methods used by
OneHotEncoder . 
OneHotEncoderModel 
param: categorySizes Original number of categories for each feature being encoded.

PCA 
PCA trains a model to project vectors to a lower dimensional space of the top
PCA!.k
principal components. 
PCAModel 
Model fitted by
PCA . 
PolynomialExpansion 
Perform feature expansion in a polynomial space.

QuantileDiscretizer 
QuantileDiscretizer takes a column with continuous features and outputs a column with binned
categorical features. 
RegexTokenizer 
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split
the text (default) or repeatedly matching the regex (if
gaps is false). 
RFormula 
Implements the transforms required for fitting a dataset against an R model formula.

RFormulaModel 
Model fitted by
RFormula . 
RFormulaParser 
Limited implementation of R formula parsing.

RobustScaler 
Scale features using statistics that are robust to outliers.

RobustScalerModel 
Model fitted by
RobustScaler . 
SQLTransformer 
Implements the transformations which are defined by SQL statement.

StandardScaler 
Standardizes features by removing the mean and scaling to unit variance using column summary
statistics on the samples in the training set.

StandardScalerModel 
Model fitted by
StandardScaler . 
StopWordsRemover 
A feature transformer that filters out stop words from input.

StringIndexer 
A label indexer that maps string column(s) of labels to ML column(s) of label indices.

StringIndexerAggregator 
A SQL
Aggregator used by StringIndexer to count labels in string columns during fitting. 
StringIndexerModel 
Model fitted by
StringIndexer . 
Tokenizer 
A tokenizer that converts the input string to lowercase and then splits it by white spaces.

UnivariateFeatureSelector 
Feature selector based on univariate statistical tests against labels.

UnivariateFeatureSelectorModel 
Model fitted by
UnivariateFeatureSelectorModel . 
VarianceThresholdSelector 
Feature selector that removes all lowvariance features.

VarianceThresholdSelectorModel 
Model fitted by
VarianceThresholdSelector . 
VectorAssembler 
A feature transformer that merges multiple columns into a vector column.

VectorAttributeRewriter 
Utility transformer that rewrites Vector attribute names via prefix replacement.

VectorIndexer 
Class for indexing categorical feature columns in a dataset of
Vector . 
VectorIndexerModel 
Model fitted by
VectorIndexer . 
VectorSizeHint 
A feature transformer that adds size information to the metadata of a vector column.

VectorSlicer 
This class takes a feature vector and outputs a new feature vector with a subarray of the
original features.

Word2Vec 
Word2Vec trains a model of
Map(String, Vector) , i.e. 
Word2VecModel 
Model fitted by
Word2Vec . 
Word2VecModel.Data$  
Word2VecModel.Word2VecModelWriter$ 
Transformer
s, which
transforms one Dataset
into another, e.g.,
HashingTF
.
Some feature transformers are implemented as Estimator
}s, because the
transformation requires some aggregated information of the dataset, e.g., document
frequencies in IDF
.
For those feature transformers, calling Estimator.fit(org.apache.spark.sql.Dataset<?>, org.apache.spark.ml.param.ParamPair<?>, org.apache.spark.ml.param.ParamPair<?>...)
is required to
obtain the model first, e.g., IDFModel
, in order to apply
transformation.
The transformation is usually done by appending new columns to the input
Dataset
, so all input columns are carried over.
We try to make each transformer minimal, so it becomes flexible to assemble feature
transformation pipelines.
Pipeline
can be used to chain feature transformers, and
VectorAssembler
can be used to combine multiple feature
transformations, for example:
import java.util.Arrays;
import org.apache.spark.api.java.JavaRDD;
import static org.apache.spark.sql.types.DataTypes.*;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.Row;
import org.apache.spark.ml.feature.*;
import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.PipelineModel;
// a DataFrame with three columns: id (integer), text (string), and rating (double).
StructType schema = createStructType(
Arrays.asList(
createStructField("id", IntegerType, false),
createStructField("text", StringType, false),
createStructField("rating", DoubleType, false)));
JavaRDD<Row> rowRDD = jsc.parallelize(
Arrays.asList(
RowFactory.create(0, "Hi I heard about Spark", 3.0),
RowFactory.create(1, "I wish Java could use case classes", 4.0),
RowFactory.create(2, "Logistic regression models are neat", 4.0)));
Dataset<Row> dataset = jsql.createDataFrame(rowRDD, schema);
// define feature transformers
RegexTokenizer tok = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("words");
StopWordsRemover sw = new StopWordsRemover()
.setInputCol("words")
.setOutputCol("filtered_words");
HashingTF tf = new HashingTF()
.setInputCol("filtered_words")
.setOutputCol("tf")
.setNumFeatures(10000);
IDF idf = new IDF()
.setInputCol("tf")
.setOutputCol("tf_idf");
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[] {"tf_idf", "rating"})
.setOutputCol("features");
// assemble and fit the feature transformation pipeline
Pipeline pipeline = new Pipeline()
.setStages(new PipelineStage[] {tok, sw, tf, idf, assembler});
PipelineModel model = pipeline.fit(dataset);
// save transformed features with raw data
model.transform(dataset)
.select("id", "text", "rating", "features")
.write().format("parquet").save("/output/path");
Some feature transformers implemented in MLlib are inspired by those implemented in scikitlearn.
The major difference is that most scikitlearn feature transformers operate eagerly on the entire
input dataset, while MLlib's feature transformers operate lazily on individual columns,
which is more efficient and flexible to handle large and complex datasets.