org.apache.spark.ml.feature (Spark 4.1.0 JavaDoc)

package org.apache.spark.ml.feature

Feature transformers The `ml.feature` package provides common feature transformers that help convert raw data or features into more suitable forms for model fitting. Most feature transformers are implemented as Transformers, which transforms one Dataset into another, e.g., HashingTF. Some feature transformers are implemented as Estimator}s, because the transformation requires some aggregated information of the dataset, e.g., document frequencies in IDF. For those feature transformers, calling

Estimator.fit(org.apache.spark.sql.Dataset<?>, org.apache.spark.ml.param.ParamPair<?>, org.apache.spark.ml.param.ParamPair<?>...)

is required to obtain the model first, e.g., IDFModel, in order to apply transformation. The transformation is usually done by appending new columns to the input Dataset, so all input columns are carried over. We try to make each transformer minimal, so it becomes flexible to assemble feature transformation pipelines. Pipeline can be used to chain feature transformers, and VectorAssembler can be used to combine multiple feature transformations, for example:

 
   import java.util.Arrays;

   import org.apache.spark.api.java.JavaRDD;
   import static org.apache.spark.sql.types.DataTypes.*;
   import org.apache.spark.sql.types.StructType;
   import org.apache.spark.sql.Dataset;
   import org.apache.spark.sql.RowFactory;
   import org.apache.spark.sql.Row;

   import org.apache.spark.ml.feature.*;
   import org.apache.spark.ml.Pipeline;
   import org.apache.spark.ml.PipelineStage;
   import org.apache.spark.ml.PipelineModel;

  // a DataFrame with three columns: id (integer), text (string), and rating (double).
  StructType schema = createStructType(
    Arrays.asList(
      createStructField("id", IntegerType, false),
      createStructField("text", StringType, false),
      createStructField("rating", DoubleType, false)));
  JavaRDD<Row> rowRDD = jsc.parallelize(
    Arrays.asList(
      RowFactory.create(0, "Hi I heard about Spark", 3.0),
      RowFactory.create(1, "I wish Java could use case classes", 4.0),
      RowFactory.create(2, "Logistic regression models are neat", 4.0)));
  Dataset<Row> dataset = jsql.createDataFrame(rowRDD, schema);
  // define feature transformers
  RegexTokenizer tok = new RegexTokenizer()
    .setInputCol("text")
    .setOutputCol("words");
  StopWordsRemover sw = new StopWordsRemover()
    .setInputCol("words")
    .setOutputCol("filtered_words");
  HashingTF tf = new HashingTF()
    .setInputCol("filtered_words")
    .setOutputCol("tf")
    .setNumFeatures(10000);
  IDF idf = new IDF()
    .setInputCol("tf")
    .setOutputCol("tf_idf");
  VectorAssembler assembler = new VectorAssembler()
    .setInputCols(new String[] {"tf_idf", "rating"})
    .setOutputCol("features");

  // assemble and fit the feature transformation pipeline
  Pipeline pipeline = new Pipeline()
    .setStages(new PipelineStage[] {tok, sw, tf, idf, assembler});
  PipelineModel model = pipeline.fit(dataset);

  // save transformed features with raw data
  model.transform(dataset)
    .select("id", "text", "rating", "features")
    .write().format("parquet").save("/output/path");

Some feature transformers implemented in MLlib are inspired by those implemented in scikit-learn. The major difference is that most scikit-learn feature transformers operate eagerly on the entire input dataset, while MLlib's feature transformers operate lazily on individual columns, which is more efficient and flexible to handle large and complex datasets.

See Also:

scikit-learn.preprocessing

Related Packages

Package

Description

org.apache.spark.ml

DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.
Class

Description

Binarizer

Binarize a column of continuous features given a threshold.

BucketedRandomProjectionLSH

This BucketedRandomProjectionLSH implements Locality Sensitive Hashing functions for Euclidean distance metrics.

BucketedRandomProjectionLSHModel

Model produced by BucketedRandomProjectionLSH, where multiple random vectors are stored.

BucketedRandomProjectionLSHModel.Data$

BucketedRandomProjectionLSHParams

Params for BucketedRandomProjectionLSH.

Bucketizer

Bucketizer maps a column of continuous features to a column of feature buckets.

ChiSqSelector

Deprecated.
use UnivariateFeatureSelector instead.

ChiSqSelectorModel

Model fitted by ChiSqSelector.

ChiSqSelectorModel.ChiSqSelectorModelWriter

ChiSqSelectorModel.Data$

ColumnPruner

Utility transformer for removing temporary columns from a DataFrame.

ColumnPruner.Data$

CountVectorizer

Extracts a vocabulary from document collections and generates a CountVectorizerModel.

CountVectorizerModel

Converts a text document to a sparse vector of token counts.

CountVectorizerModel.Data$

CountVectorizerParams

Params for CountVectorizer and CountVectorizerModel.

DCT

A feature transformer that takes the 1D discrete cosine transform of a real vector.

Dot

ElementwiseProduct

Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector.

EmptyTerm

Placeholder term for the result of undefined interactions, e.g.

FeatureHasher

Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space).

HashingTF

Maps a sequence of terms to their term frequencies using the hashing trick.

IDF

Compute the Inverse Document Frequency (IDF) given a collection of documents.

IDFBase

Params for IDF and IDFModel.

IDFModel

Model fitted by IDF.

IDFModel.Data$

Imputer

Imputation estimator for completing missing values, using the mean, median or mode of the columns in which the missing values are located.

ImputerModel

Model fitted by Imputer.

ImputerParams

Params for Imputer and ImputerModel.

IndexToString

A Transformer that maps a column of indices back to a new column of corresponding string values.

InteractableTerm

A term that may be part of an interaction, e.g.

Interaction

Implements the feature interaction transform.

LabeledPoint

Class that represents the features and label of a data point.

LSHParams

Params for LSH.

MaxAbsScaler

Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature.

MaxAbsScalerModel

Model fitted by MaxAbsScaler.

MaxAbsScalerModel.Data$

MaxAbsScalerParams

Params for MaxAbsScaler and MaxAbsScalerModel.

MinHashLSH

LSH class for Jaccard distance.

MinHashLSHModel

Model produced by MinHashLSH, where multiple hash functions are stored.

MinHashLSHModel.Data$

MinMaxScaler

Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling.

MinMaxScalerModel

Model fitted by MinMaxScaler.

MinMaxScalerModel.Data$

MinMaxScalerParams

Params for MinMaxScaler and MinMaxScalerModel.

NGram

A feature transformer that converts the input array of strings into an array of n-grams.

Normalizer

Normalize a vector to have unit norm using the given p-norm.

OneHotEncoder

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.

OneHotEncoderBase

Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel

OneHotEncoderCommon

Provides some helper methods used by OneHotEncoder.

OneHotEncoderModel

param: categorySizes Original number of categories for each feature being encoded.

OneHotEncoderModel.Data$

PCA

PCA trains a model to project vectors to a lower dimensional space of the top PCA!.k principal components.

PCAModel

Model fitted by PCA.

PCAModel.Data$

PCAParams

Params for PCA and PCAModel.

PolynomialExpansion

Perform feature expansion in a polynomial space.

QuantileDiscretizer

QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features.

QuantileDiscretizerBase

Params for QuantileDiscretizer.

RegexTokenizer

A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false).

RFormula

Implements the transforms required for fitting a dataset against an R model formula.

RFormulaBase

Base trait for RFormula and RFormulaModel.

RFormulaModel

Model fitted by RFormula.

RFormulaParser

Limited implementation of R formula parsing.

RobustScaler

Scale features using statistics that are robust to outliers.

RobustScalerModel

Model fitted by RobustScaler.

RobustScalerModel.Data$

RobustScalerParams

Params for RobustScaler and RobustScalerModel.

SelectorParams

Params for Selector and SelectorModel.

SQLTransformer

Implements the transformations which are defined by SQL statement.

StandardScaler

Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

StandardScalerModel

Model fitted by StandardScaler.

StandardScalerModel.Data$

StandardScalerParams

Params for StandardScaler and StandardScalerModel.

StopWordsRemover

A feature transformer that filters out stop words from input.

StringIndexer

A label indexer that maps string column(s) of labels to ML column(s) of label indices.

StringIndexerBase

Base trait for StringIndexer and StringIndexerModel.

StringIndexerModel

Model fitted by StringIndexer.

StringIndexerModel.Data$

TargetEncoder

Target Encoding maps a column of categorical indices into a numerical feature derived from the target.

TargetEncoderBase

Private trait for params and common methods for TargetEncoder and TargetEncoderModel

TargetEncoderModel

param: stats Array of statistics for each input feature.

TargetEncoderModel.Data$

Term

R formula terms.

Tokenizer

A tokenizer that converts the input string to lowercase and then splits it by white spaces.

UnivariateFeatureSelector

Feature selector based on univariate statistical tests against labels.

UnivariateFeatureSelectorModel

Model fitted by UnivariateFeatureSelectorModel.

UnivariateFeatureSelectorModel.Data$

UnivariateFeatureSelectorParams

Params for UnivariateFeatureSelector and UnivariateFeatureSelectorModel.

VarianceThresholdSelector

Feature selector that removes all low-variance features.

VarianceThresholdSelectorModel

Model fitted by VarianceThresholdSelector.

VarianceThresholdSelectorModel.Data$

VarianceThresholdSelectorParams

Params for VarianceThresholdSelector and VarianceThresholdSelectorModel.

VectorAssembler

A feature transformer that merges multiple columns into a vector column.

VectorAttributeRewriter

Utility transformer that rewrites Vector attribute names via prefix replacement.

VectorAttributeRewriter.Data$

VectorIndexer

Class for indexing categorical feature columns in a dataset of Vector.

VectorIndexerModel

Model fitted by VectorIndexer.

VectorIndexerModel.Data$

VectorIndexerParams

Private trait for params for VectorIndexer and VectorIndexerModel

VectorSizeHint

A feature transformer that adds size information to the metadata of a vector column.

VectorSlicer

This class takes a feature vector and outputs a new feature vector with a subarray of the original features.

Word2Vec

Word2Vec trains a model of Map(String, Vector), i.e.

Word2VecBase

Params for Word2Vec and Word2VecModel.

Word2VecModel

Model fitted by Word2Vec.

Word2VecModel.Data$

Word2VecModel.Word2VecModelWriter$

Package org.apache.spark.ml.feature