pyspark.ml package

ML Pipeline APIs

DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.

class pyspark.ml.Transformer[source]

Abstract class for transformers that transform one dataset into another.

New in version 1.3.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

set(param, value)

Sets a parameter in the embedded param map.

transform(dataset, params=None)[source]

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

class pyspark.ml.UnaryTransformer[source]

Abstract class for transformers that take one input column, apply transformation, and output the result as a new column.

New in version 2.3.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

abstract createTransformFunc()[source]

Creates the transform function using the given param map. The input param map already takes account of the embedded param map. So the param values should be determined solely by the input param map.

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getInputCol()

Gets the value of inputCol or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
abstract outputDataType()[source]

Returns the data type of the output column.

property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

setOutputCol(value)[source]

Sets the value of outputCol.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

transformSchema(schema)[source]
abstract validateInputType(inputType)[source]

Validates the input type. Throw an exception if it is invalid.

class pyspark.ml.Estimator[source]

Abstract class for estimators that fit models to data.

New in version 1.3.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)[source]

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)[source]

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

set(param, value)

Sets a parameter in the embedded param map.

class pyspark.ml.Model[source]

Abstract class for models that are fitted by estimators.

New in version 1.4.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

set(param, value)

Sets a parameter in the embedded param map.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

class pyspark.ml.Pipeline(stages=None)[source]

A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer. When Pipeline.fit() is called, the stages are executed in order. If a stage is an Estimator, its Estimator.fit() method will be called on the input dataset to fit a model. Then the model, which is a transformer, will be used to transform the dataset as the input to the next stage. If a stage is a Transformer, its Transformer.transform() method will be called to produce the dataset for the next stage. The fitted model from a Pipeline is a PipelineModel, which consists of fitted models and transformers, corresponding to the pipeline stages. If stages is an empty list, the pipeline acts as an identity transformer.

New in version 1.3.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)[source]

Creates a copy of this instance.

Parameters

extra – extra parameters

Returns

new instance

New in version 1.4.0.

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)

Gets a param by its name.

getStages()[source]

Get pipeline stages.

New in version 1.3.0.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()[source]

Returns an MLReader instance for this class.

New in version 2.0.0.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setParams(self, stages=None)[source]

Sets params for Pipeline.

New in version 1.3.0.

setStages(value)[source]

Set pipeline stages.

Parameters

value – a list of transformers or estimators

Returns

the pipeline instance

New in version 1.3.0.

stages = Param(parent='undefined', name='stages', doc='a list of pipeline stages')
write()[source]

Returns an MLWriter instance for this ML instance.

New in version 2.0.0.

class pyspark.ml.PipelineModel(stages)[source]

Represents a compiled pipeline with transformers and fitted models.

New in version 1.3.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)[source]

Creates a copy of this instance.

Parameters

extra – extra parameters

Returns

new instance

New in version 1.4.0.

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()[source]

Returns an MLReader instance for this class.

New in version 2.0.0.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()[source]

Returns an MLWriter instance for this ML instance.

New in version 2.0.0.

pyspark.ml.param module

class pyspark.ml.param.Param(parent, name, doc, typeConverter=None)[source]

A param with self-contained documentation.

New in version 1.3.0.

class pyspark.ml.param.Params[source]

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

clear(param)[source]

Clears a param from the param map if it has been explicitly set.

copy(extra=None)[source]

Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)[source]

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()[source]

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)[source]

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getOrDefault(param)[source]

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)[source]

Gets a param by its name.

hasDefault(param)[source]

Checks whether a param has a default value.

hasParam(paramName)[source]

Tests whether this instance contains a param with a given (string) name.

isDefined(param)[source]

Checks whether a param is explicitly set by user or has a default value.

isSet(param)[source]

Checks whether a param is explicitly set by user.

property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

set(param, value)[source]

Sets a parameter in the embedded param map.

class pyspark.ml.param.TypeConverters[source]

Factory methods for common type conversion functions for Param.typeConverter.

New in version 2.0.0.

static identity(value)[source]

Dummy converter that just returns value.

static toBoolean(value)[source]

Convert a value to a boolean, if possible.

static toFloat(value)[source]

Convert a value to a float, if possible.

static toInt(value)[source]

Convert a value to an int, if possible.

static toList(value)[source]

Convert a value to a list, if possible.

static toListFloat(value)[source]

Convert a value to list of floats, if possible.

static toListInt(value)[source]

Convert a value to list of ints, if possible.

static toListListFloat(value)[source]

Convert a value to list of list of floats, if possible.

static toListString(value)[source]

Convert a value to list of strings, if possible.

static toMatrix(value)[source]

Convert a value to a MLlib Matrix, if possible.

static toString(value)[source]

Convert a value to a string, if possible.

static toVector(value)[source]

Convert a value to a MLlib Vector, if possible.

pyspark.ml.feature module

class pyspark.ml.feature.Binarizer(threshold=0.0, inputCol=None, outputCol=None, thresholds=None, inputCols=None, outputCols=None)[source]

Binarize a column of continuous features given a threshold. Since 3.0.0, Binarize can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The threshold parameter is used for single column usage, and thresholds is for multiple columns.

>>> df = spark.createDataFrame([(0.5,)], ["values"])
>>> binarizer = Binarizer(threshold=1.0, inputCol="values", outputCol="features")
>>> binarizer.setThreshold(1.0)
Binarizer...
>>> binarizer.setInputCol("values")
Binarizer...
>>> binarizer.setOutputCol("features")
Binarizer...
>>> binarizer.transform(df).head().features
0.0
>>> binarizer.setParams(outputCol="freqs").transform(df).head().freqs
0.0
>>> params = {binarizer.threshold: -0.5, binarizer.outputCol: "vector"}
>>> binarizer.transform(df, params).head().vector
1.0
>>> binarizerPath = temp_path + "/binarizer"
>>> binarizer.save(binarizerPath)
>>> loadedBinarizer = Binarizer.load(binarizerPath)
>>> loadedBinarizer.getThreshold() == binarizer.getThreshold()
True
>>> loadedBinarizer.transform(df).take(1) == binarizer.transform(df).take(1)
True
>>> df2 = spark.createDataFrame([(0.5, 0.3)], ["values1", "values2"])
>>> binarizer2 = Binarizer(thresholds=[0.0, 1.0])
>>> binarizer2.setInputCols(["values1", "values2"]).setOutputCols(["output1", "output2"])
Binarizer...
>>> binarizer2.transform(df2).show()
+-------+-------+-------+-------+
|values1|values2|output1|output2|
+-------+-------+-------+-------+
|    0.5|    0.3|    1.0|    0.0|
+-------+-------+-------+-------+
...

New in version 1.4.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getInputCol()

Gets the value of inputCol or its default value.

getInputCols()

Gets the value of inputCols or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getOutputCols()

Gets the value of outputCols or its default value.

getParam(paramName)

Gets a param by its name.

getThreshold()

Gets the value of threshold or its default value.

getThresholds()

Gets the value of thresholds or its default value.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
inputCols = Param(parent='undefined', name='inputCols', doc='input column names.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
outputCols = Param(parent='undefined', name='outputCols', doc='output column names.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

setInputCols(value)[source]

Sets the value of inputCols.

New in version 3.0.0.

setOutputCol(value)[source]

Sets the value of outputCol.

setOutputCols(value)[source]

Sets the value of outputCols.

New in version 3.0.0.

setParams(self, threshold=0.0, inputCol=None, outputCol=None, thresholds=None, inputCols=None, outputCols=None)[source]

Sets params for this Binarizer.

New in version 1.4.0.

setThreshold(value)[source]

Sets the value of threshold.

New in version 1.4.0.

setThresholds(value)[source]

Sets the value of thresholds.

New in version 3.0.0.

threshold = Param(parent='undefined', name='threshold', doc='Param for threshold used to binarize continuous features. The features greater than the threshold will be binarized to 1.0. The features equal to or less than the threshold will be binarized to 0.0')
thresholds = Param(parent='undefined', name='thresholds', doc='Param for array of threshold used to binarize continuous features. This is for multiple columns input. If transforming multiple columns and thresholds is not set, but threshold is set, then threshold will be applied across all columns.')
transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.BucketedRandomProjectionLSH(inputCol=None, outputCol=None, seed=None, numHashTables=1, bucketLength=None)[source]

LSH class for Euclidean distance metrics. The input is dense or sparse vectors, each of which represents a point in the Euclidean distance space. The output will be vectors of configurable dimension. Hash values in the same dimension are calculated by the same hash function.

>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.sql.functions import col
>>> data = [(0, Vectors.dense([-1.0, -1.0 ]),),
...         (1, Vectors.dense([-1.0, 1.0 ]),),
...         (2, Vectors.dense([1.0, -1.0 ]),),
...         (3, Vectors.dense([1.0, 1.0]),)]
>>> df = spark.createDataFrame(data, ["id", "features"])
>>> brp = BucketedRandomProjectionLSH()
>>> brp.setInputCol("features")
BucketedRandomProjectionLSH...
>>> brp.setOutputCol("hashes")
BucketedRandomProjectionLSH...
>>> brp.setSeed(12345)
BucketedRandomProjectionLSH...
>>> brp.setBucketLength(1.0)
BucketedRandomProjectionLSH...
>>> model = brp.fit(df)
>>> model.getBucketLength()
1.0
>>> model.setOutputCol("hashes")
BucketedRandomProjectionLSHModel...
>>> model.transform(df).head()
Row(id=0, features=DenseVector([-1.0, -1.0]), hashes=[DenseVector([-1.0])])
>>> data2 = [(4, Vectors.dense([2.0, 2.0 ]),),
...          (5, Vectors.dense([2.0, 3.0 ]),),
...          (6, Vectors.dense([3.0, 2.0 ]),),
...          (7, Vectors.dense([3.0, 3.0]),)]
>>> df2 = spark.createDataFrame(data2, ["id", "features"])
>>> model.approxNearestNeighbors(df2, Vectors.dense([1.0, 2.0]), 1).collect()
[Row(id=4, features=DenseVector([2.0, 2.0]), hashes=[DenseVector([1.0])], distCol=1.0)]
>>> model.approxSimilarityJoin(df, df2, 3.0, distCol="EuclideanDistance").select(
...     col("datasetA.id").alias("idA"),
...     col("datasetB.id").alias("idB"),
...     col("EuclideanDistance")).show()
+---+---+-----------------+
|idA|idB|EuclideanDistance|
+---+---+-----------------+
|  3|  6| 2.23606797749979|
+---+---+-----------------+
...
>>> model.approxSimilarityJoin(df, df2, 3, distCol="EuclideanDistance").select(
...     col("datasetA.id").alias("idA"),
...     col("datasetB.id").alias("idB"),
...     col("EuclideanDistance")).show()
+---+---+-----------------+
|idA|idB|EuclideanDistance|
+---+---+-----------------+
|  3|  6| 2.23606797749979|
+---+---+-----------------+
...
>>> brpPath = temp_path + "/brp"
>>> brp.save(brpPath)
>>> brp2 = BucketedRandomProjectionLSH.load(brpPath)
>>> brp2.getBucketLength() == brp.getBucketLength()
True
>>> modelPath = temp_path + "/brp-model"
>>> model.save(modelPath)
>>> model2 = BucketedRandomProjectionLSHModel.load(modelPath)
>>> model.transform(df).head().hashes == model2.transform(df).head().hashes
True

New in version 2.2.0.

bucketLength = Param(parent='undefined', name='bucketLength', doc='the length of each hash bucket, a larger bucket lowers the false negative rate.')
clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getBucketLength()

Gets the value of bucketLength or its default value.

New in version 2.2.0.

getInputCol()

Gets the value of inputCol or its default value.

getNumHashTables()

Gets the value of numHashTables or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getSeed()

Gets the value of seed or its default value.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

numHashTables = Param(parent='undefined', name='numHashTables', doc='number of hash tables, where increasing number of hash tables lowers the false negative rate, and decreasing it improves the running performance.')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

seed = Param(parent='undefined', name='seed', doc='random seed.')
set(param, value)

Sets a parameter in the embedded param map.

setBucketLength(value)[source]

Sets the value of bucketLength.

New in version 2.2.0.

setInputCol(value)

Sets the value of inputCol.

setNumHashTables(value)

Sets the value of numHashTables.

setOutputCol(value)

Sets the value of outputCol.

setParams(self, inputCol=None, outputCol=None, seed=None, numHashTables=1, bucketLength=None)[source]

Sets params for this BucketedRandomProjectionLSH.

New in version 2.2.0.

setSeed(value)[source]

Sets the value of seed.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.BucketedRandomProjectionLSHModel(java_model=None)[source]

Model fitted by BucketedRandomProjectionLSH, where multiple random vectors are stored. The vectors are normalized to be unit vectors and each vector is used in a hash function: \(h_i(x) = floor(r_i \cdot x / bucketLength)\) where \(r_i\) is the i-th random unit vector. The number of buckets will be (max L2 norm of input vectors) / bucketLength.

New in version 2.2.0.

approxNearestNeighbors(dataset, key, numNearestNeighbors, distCol='distCol')

Given a large dataset and an item, approximately find at most k items which have the closest distance to the item. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use that. This allows caching of the transformed data when necessary.

Note

This method is experimental and will likely change behavior in the next release.

Parameters
  • dataset – The dataset to search for nearest neighbors of the key.

  • key – Feature vector representing the item to search for.

  • numNearestNeighbors – The maximum number of nearest neighbors.

  • distCol – Output column for storing the distance between each result row and the key. Use “distCol” as default value if it’s not specified.

Returns

A dataset containing at most k items closest to the key. A column “distCol” is added to show the distance between each row and the key.

approxSimilarityJoin(datasetA, datasetB, threshold, distCol='distCol')

Join two datasets to approximately find all pairs of rows whose distance are smaller than the threshold. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use that. This allows caching of the transformed data when necessary.

Parameters
  • datasetA – One of the datasets to join.

  • datasetB – Another dataset to join.

  • threshold – The threshold for the distance of row pairs.

  • distCol – Output column for storing the distance between each pair of rows. Use “distCol” as default value if it’s not specified.

Returns

A joined dataset containing pairs of rows. The original rows are in columns “datasetA” and “datasetB”, and a column “distCol” is added to show the distance between each pair.

bucketLength = Param(parent='undefined', name='bucketLength', doc='the length of each hash bucket, a larger bucket lowers the false negative rate.')
clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getBucketLength()

Gets the value of bucketLength or its default value.

New in version 2.2.0.

getInputCol()

Gets the value of inputCol or its default value.

getNumHashTables()

Gets the value of numHashTables or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

numHashTables = Param(parent='undefined', name='numHashTables', doc='number of hash tables, where increasing number of hash tables lowers the false negative rate, and decreasing it improves the running performance.')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)

Sets the value of inputCol.

setOutputCol(value)

Sets the value of outputCol.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.Bucketizer(splits=None, inputCol=None, outputCol=None, handleInvalid='error', splitsArray=None, inputCols=None, outputCols=None)[source]

Maps a column of continuous features to a column of feature buckets. Since 3.0.0, Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns.

>>> values = [(0.1, 0.0), (0.4, 1.0), (1.2, 1.3), (1.5, float("nan")),
...     (float("nan"), 1.0), (float("nan"), 0.0)]
>>> df = spark.createDataFrame(values, ["values1", "values2"])
>>> bucketizer = Bucketizer()
>>> bucketizer.setSplits([-float("inf"), 0.5, 1.4, float("inf")])
Bucketizer...
>>> bucketizer.setInputCol("values1")
Bucketizer...
>>> bucketizer.setOutputCol("buckets")
Bucketizer...
>>> bucketed = bucketizer.setHandleInvalid("keep").transform(df).collect()
>>> bucketed = bucketizer.setHandleInvalid("keep").transform(df.select("values1"))
>>> bucketed.show(truncate=False)
+-------+-------+
|values1|buckets|
+-------+-------+
|0.1    |0.0    |
|0.4    |0.0    |
|1.2    |1.0    |
|1.5    |2.0    |
|NaN    |3.0    |
|NaN    |3.0    |
+-------+-------+
...
>>> bucketizer.setParams(outputCol="b").transform(df).head().b
0.0
>>> bucketizerPath = temp_path + "/bucketizer"
>>> bucketizer.save(bucketizerPath)
>>> loadedBucketizer = Bucketizer.load(bucketizerPath)
>>> loadedBucketizer.getSplits() == bucketizer.getSplits()
True
>>> loadedBucketizer.transform(df).take(1) == bucketizer.transform(df).take(1)
True
>>> bucketed = bucketizer.setHandleInvalid("skip").transform(df).collect()
>>> len(bucketed)
4
>>> bucketizer2 = Bucketizer(splitsArray=
...     [[-float("inf"), 0.5, 1.4, float("inf")], [-float("inf"), 0.5, float("inf")]],
...     inputCols=["values1", "values2"], outputCols=["buckets1", "buckets2"])
>>> bucketed2 = bucketizer2.setHandleInvalid("keep").transform(df)
>>> bucketed2.show(truncate=False)
+-------+-------+--------+--------+
|values1|values2|buckets1|buckets2|
+-------+-------+--------+--------+
|0.1    |0.0    |0.0     |0.0     |
|0.4    |1.0    |0.0     |1.0     |
|1.2    |1.3    |1.0     |1.0     |
|1.5    |NaN    |2.0     |2.0     |
|NaN    |1.0    |3.0     |1.0     |
|NaN    |0.0    |3.0     |0.0     |
+-------+-------+--------+--------+
...

New in version 1.4.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getHandleInvalid()

Gets the value of handleInvalid or its default value.

getInputCol()

Gets the value of inputCol or its default value.

getInputCols()

Gets the value of inputCols or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getOutputCols()

Gets the value of outputCols or its default value.

getParam(paramName)

Gets a param by its name.

getSplits()[source]

Gets the value of threshold or its default value.

New in version 1.4.0.

getSplitsArray()[source]

Gets the array of split points or its default value.

New in version 3.0.0.

handleInvalid = Param(parent='undefined', name='handleInvalid', doc="how to handle invalid entries containing NaN values. Values outside the splits will always be treated as errors. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Note that in the multiple column case, the invalid handling is applied to all columns. That said for 'error' it will throw an error if any invalids are found in any column, for 'skip' it will skip rows with any invalids in any columns, etc.")
hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
inputCols = Param(parent='undefined', name='inputCols', doc='input column names.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
outputCols = Param(parent='undefined', name='outputCols', doc='output column names.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setHandleInvalid(value)[source]

Sets the value of handleInvalid.

setInputCol(value)[source]

Sets the value of inputCol.

setInputCols(value)[source]

Sets the value of inputCols.

New in version 3.0.0.

setOutputCol(value)[source]

Sets the value of outputCol.

setOutputCols(value)[source]

Sets the value of outputCols.

New in version 3.0.0.

setParams(self, splits=None, inputCol=None, outputCol=None, handleInvalid="error", splitsArray=None, inputCols=None, outputCols=None)[source]

Sets params for this Bucketizer.

New in version 1.4.0.

setSplits(value)[source]

Sets the value of splits.

New in version 1.4.0.

setSplitsArray(value)[source]

Sets the value of splitsArray.

New in version 3.0.0.

splits = Param(parent='undefined', name='splits', doc='Split points for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. The splits should be of length >= 3 and strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; otherwise, values outside the splits specified will be treated as errors.')
splitsArray = Param(parent='undefined', name='splitsArray', doc='The array of split points for mapping continuous features into buckets for multiple columns. For each input column, with n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. The splits should be of length >= 3 and strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; otherwise, values outside the splits specified will be treated as errors.')
transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.ChiSqSelector(numTopFeatures=50, featuresCol='features', outputCol=None, labelCol='label', selectorType='numTopFeatures', percentile=0.1, fpr=0.05, fdr=0.05, fwe=0.05)[source]

Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label. The selector supports different selection methods: numTopFeatures, percentile, fpr, fdr, fwe.

  • numTopFeatures chooses a fixed number of top features according to a chi-squared test.

  • percentile is similar but chooses a fraction of all features instead of a fixed number.

  • fpr chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.

  • fdr uses the Benjamini-Hochberg procedure to choose all features whose false discovery rate is below a threshold.

  • fwe chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.

By default, the selection method is numTopFeatures, with the default number of top features set to 50.

>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame(
...    [(Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0),
...     (Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0),
...     (Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0)],
...    ["features", "label"])
>>> selector = ChiSqSelector(numTopFeatures=1, outputCol="selectedFeatures")
>>> model = selector.fit(df)
>>> model.getFeaturesCol()
'features'
>>> model.setFeaturesCol("features")
ChiSqSelectorModel...
>>> model.transform(df).head().selectedFeatures
DenseVector([18.0])
>>> model.selectedFeatures
[2]
>>> chiSqSelectorPath = temp_path + "/chi-sq-selector"
>>> selector.save(chiSqSelectorPath)
>>> loadedSelector = ChiSqSelector.load(chiSqSelectorPath)
>>> loadedSelector.getNumTopFeatures() == selector.getNumTopFeatures()
True
>>> modelPath = temp_path + "/chi-sq-selector-model"
>>> model.save(modelPath)
>>> loadedModel = ChiSqSelectorModel.load(modelPath)
>>> loadedModel.selectedFeatures == model.selectedFeatures
True
>>> loadedModel.transform(df).take(1) == model.transform(df).take(1)
True

New in version 2.0.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fdr = Param(parent='undefined', name='fdr', doc='The upper bound of the expected false discovery rate.')
featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name.')
fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

fpr = Param(parent='undefined', name='fpr', doc='The highest p-value for features to be kept.')
fwe = Param(parent='undefined', name='fwe', doc='The upper bound of the expected family-wise error rate.')
getFdr()

Gets the value of fdr or its default value.

New in version 2.2.0.

getFeaturesCol()

Gets the value of featuresCol or its default value.

getFpr()

Gets the value of fpr or its default value.

New in version 2.1.0.

getFwe()

Gets the value of fwe or its default value.

New in version 2.2.0.

getLabelCol()

Gets the value of labelCol or its default value.

getNumTopFeatures()

Gets the value of numTopFeatures or its default value.

New in version 2.0.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getPercentile()

Gets the value of percentile or its default value.

New in version 2.1.0.

getSelectorType()

Gets the value of selectorType or its default value.

New in version 2.1.0.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

labelCol = Param(parent='undefined', name='labelCol', doc='label column name.')
classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

numTopFeatures = Param(parent='undefined', name='numTopFeatures', doc='Number of features that selector will select, ordered by ascending p-value. If the number of features is < numTopFeatures, then this will select all features.')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

percentile = Param(parent='undefined', name='percentile', doc='Percentile of features that selector will select, ordered by ascending p-value.')
classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

selectorType = Param(parent='undefined', name='selectorType', doc='The selector type of the ChisqSelector. Supported options: numTopFeatures (default), percentile, fpr, fdr, fwe.')
set(param, value)

Sets a parameter in the embedded param map.

setFdr(value)[source]

Sets the value of fdr. Only applicable when selectorType = “fdr”.

New in version 2.2.0.

setFeaturesCol(value)[source]

Sets the value of featuresCol.

setFpr(value)[source]

Sets the value of fpr. Only applicable when selectorType = “fpr”.

New in version 2.1.0.

setFwe(value)[source]

Sets the value of fwe. Only applicable when selectorType = “fwe”.

New in version 2.2.0.

setLabelCol(value)[source]

Sets the value of labelCol.

setNumTopFeatures(value)[source]

Sets the value of numTopFeatures. Only applicable when selectorType = “numTopFeatures”.

New in version 2.0.0.

setOutputCol(value)[source]

Sets the value of outputCol.

setParams(self, numTopFeatures=50, featuresCol="features", outputCol=None, labelCol="labels", selectorType="numTopFeatures", percentile=0.1, fpr=0.05, fdr=0.05, fwe=0.05)[source]

Sets params for this ChiSqSelector.

New in version 2.0.0.

setPercentile(value)[source]

Sets the value of percentile. Only applicable when selectorType = “percentile”.

New in version 2.1.0.

setSelectorType(value)[source]

Sets the value of selectorType.

New in version 2.1.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.ChiSqSelectorModel(java_model=None)[source]

Model fitted by ChiSqSelector.

New in version 2.0.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fdr = Param(parent='undefined', name='fdr', doc='The upper bound of the expected false discovery rate.')
featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name.')
fpr = Param(parent='undefined', name='fpr', doc='The highest p-value for features to be kept.')
fwe = Param(parent='undefined', name='fwe', doc='The upper bound of the expected family-wise error rate.')
getFdr()

Gets the value of fdr or its default value.

New in version 2.2.0.

getFeaturesCol()

Gets the value of featuresCol or its default value.

getFpr()

Gets the value of fpr or its default value.

New in version 2.1.0.

getFwe()

Gets the value of fwe or its default value.

New in version 2.2.0.

getLabelCol()

Gets the value of labelCol or its default value.

getNumTopFeatures()

Gets the value of numTopFeatures or its default value.

New in version 2.0.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getPercentile()

Gets the value of percentile or its default value.

New in version 2.1.0.

getSelectorType()

Gets the value of selectorType or its default value.

New in version 2.1.0.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

labelCol = Param(parent='undefined', name='labelCol', doc='label column name.')
classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

numTopFeatures = Param(parent='undefined', name='numTopFeatures', doc='Number of features that selector will select, ordered by ascending p-value. If the number of features is < numTopFeatures, then this will select all features.')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

percentile = Param(parent='undefined', name='percentile', doc='Percentile of features that selector will select, ordered by ascending p-value.')
classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

property selectedFeatures

List of indices to select (filter).

New in version 2.0.0.

selectorType = Param(parent='undefined', name='selectorType', doc='The selector type of the ChisqSelector. Supported options: numTopFeatures (default), percentile, fpr, fdr, fwe.')
set(param, value)

Sets a parameter in the embedded param map.

setFeaturesCol(value)[source]

Sets the value of featuresCol.

New in version 3.0.0.

setOutputCol(value)[source]

Sets the value of outputCol.

New in version 3.0.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.CountVectorizer(minTF=1.0, minDF=1.0, maxDF=9223372036854775807, vocabSize=262144, binary=False, inputCol=None, outputCol=None)[source]

Extracts a vocabulary from document collections and generates a CountVectorizerModel.

>>> df = spark.createDataFrame(
...    [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])],
...    ["label", "raw"])
>>> cv = CountVectorizer()
>>> cv.setInputCol("raw")
CountVectorizer...
>>> cv.setOutputCol("vectors")
CountVectorizer...
>>> model = cv.fit(df)
>>> model.setInputCol("raw")
CountVectorizerModel...
>>> model.transform(df).show(truncate=False)
+-----+---------------+-------------------------+
|label|raw            |vectors                  |
+-----+---------------+-------------------------+
|0    |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
|1    |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+-----+---------------+-------------------------+
...
>>> sorted(model.vocabulary) == ['a', 'b', 'c']
True
>>> countVectorizerPath = temp_path + "/count-vectorizer"
>>> cv.save(countVectorizerPath)
>>> loadedCv = CountVectorizer.load(countVectorizerPath)
>>> loadedCv.getMinDF() == cv.getMinDF()
True
>>> loadedCv.getMinTF() == cv.getMinTF()
True
>>> loadedCv.getVocabSize() == cv.getVocabSize()
True
>>> modelPath = temp_path + "/count-vectorizer-model"
>>> model.save(modelPath)
>>> loadedModel = CountVectorizerModel.load(modelPath)
>>> loadedModel.vocabulary == model.vocabulary
True
>>> loadedModel.transform(df).take(1) == model.transform(df).take(1)
True
>>> fromVocabModel = CountVectorizerModel.from_vocabulary(["a", "b", "c"],
...     inputCol="raw", outputCol="vectors")
>>> fromVocabModel.transform(df).show(truncate=False)
+-----+---------------+-------------------------+
|label|raw            |vectors                  |
+-----+---------------+-------------------------+
|0    |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
|1    |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+-----+---------------+-------------------------+
...

New in version 1.6.0.

binary = Param(parent='undefined', name='binary', doc='Binary toggle to control the output vector values. If True, all nonzero counts (after minTF filter applied) are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. Default False')
clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getBinary()

Gets the value of binary or its default value.

New in version 2.0.0.

getInputCol()

Gets the value of inputCol or its default value.

getMaxDF()

Gets the value of maxDF or its default value.

New in version 2.4.0.

getMinDF()

Gets the value of minDF or its default value.

New in version 1.6.0.

getMinTF()

Gets the value of minTF or its default value.

New in version 1.6.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getVocabSize()

Gets the value of vocabSize or its default value.

New in version 1.6.0.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

maxDF = Param(parent='undefined', name='maxDF', doc='Specifies the maximum number of different documents a term could appear in to be included in the vocabulary. A term that appears more than the threshold will be ignored. If this is an integer >= 1, this specifies the maximum number of documents the term could appear in; if this is a double in [0,1), then this specifies the maximum fraction of documents the term could appear in. Default (2^63) - 1')
minDF = Param(parent='undefined', name='minDF', doc='Specifies the minimum number of different documents a term must appear in to be included in the vocabulary. If this is an integer >= 1, this specifies the number of documents the term must appear in; if this is a double in [0,1), then this specifies the fraction of documents. Default 1.0')
minTF = Param(parent='undefined', name='minTF', doc="Filter to ignore rare words in a document. For each document, terms with frequency/count less than the given threshold are ignored. If this is an integer >= 1, then this specifies a count (of times the term must appear in the document); if this is a double in [0,1), then this specifies a fraction (out of the document's token count). Note that the parameter is only used in transform of CountVectorizerModel and does not affect fitting. Default 1.0")
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setBinary(value)[source]

Sets the value of binary.

New in version 2.0.0.

setInputCol(value)[source]

Sets the value of inputCol.

setMaxDF(value)[source]

Sets the value of maxDF.

New in version 2.4.0.

setMinDF(value)[source]

Sets the value of minDF.

New in version 1.6.0.

setMinTF(value)[source]

Sets the value of minTF.

New in version 1.6.0.

setOutputCol(value)[source]

Sets the value of outputCol.

setParams(self, minTF=1.0, minDF=1.0, maxDF=2 ** 63 - 1, vocabSize=1 << 18, binary=False, inputCol=None, outputCol=None)[source]

Set the params for the CountVectorizer

New in version 1.6.0.

setVocabSize(value)[source]

Sets the value of vocabSize.

New in version 1.6.0.

vocabSize = Param(parent='undefined', name='vocabSize', doc='max size of the vocabulary. Default 1 << 18.')
write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.CountVectorizerModel(java_model=None)[source]

Model fitted by CountVectorizer.

New in version 1.6.0.

binary = Param(parent='undefined', name='binary', doc='Binary toggle to control the output vector values. If True, all nonzero counts (after minTF filter applied) are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. Default False')
clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

classmethod from_vocabulary(vocabulary, inputCol, outputCol=None, minTF=None, binary=None)[source]

Construct the model directly from a vocabulary list of strings, requires an active SparkContext.

New in version 2.4.0.

getBinary()

Gets the value of binary or its default value.

New in version 2.0.0.

getInputCol()

Gets the value of inputCol or its default value.

getMaxDF()

Gets the value of maxDF or its default value.

New in version 2.4.0.

getMinDF()

Gets the value of minDF or its default value.

New in version 1.6.0.

getMinTF()

Gets the value of minTF or its default value.

New in version 1.6.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getVocabSize()

Gets the value of vocabSize or its default value.

New in version 1.6.0.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

maxDF = Param(parent='undefined', name='maxDF', doc='Specifies the maximum number of different documents a term could appear in to be included in the vocabulary. A term that appears more than the threshold will be ignored. If this is an integer >= 1, this specifies the maximum number of documents the term could appear in; if this is a double in [0,1), then this specifies the maximum fraction of documents the term could appear in. Default (2^63) - 1')
minDF = Param(parent='undefined', name='minDF', doc='Specifies the minimum number of different documents a term must appear in to be included in the vocabulary. If this is an integer >= 1, this specifies the number of documents the term must appear in; if this is a double in [0,1), then this specifies the fraction of documents. Default 1.0')
minTF = Param(parent='undefined', name='minTF', doc="Filter to ignore rare words in a document. For each document, terms with frequency/count less than the given threshold are ignored. If this is an integer >= 1, then this specifies a count (of times the term must appear in the document); if this is a double in [0,1), then this specifies a fraction (out of the document's token count). Note that the parameter is only used in transform of CountVectorizerModel and does not affect fitting. Default 1.0")
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setBinary(value)[source]

Sets the value of binary.

New in version 2.4.0.

setInputCol(value)[source]

Sets the value of inputCol.

New in version 3.0.0.

setMinTF(value)[source]

Sets the value of minTF.

New in version 2.4.0.

setOutputCol(value)[source]

Sets the value of outputCol.

New in version 3.0.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

vocabSize = Param(parent='undefined', name='vocabSize', doc='max size of the vocabulary. Default 1 << 18.')
property vocabulary

An array of terms in the vocabulary.

New in version 1.6.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.DCT(inverse=False, inputCol=None, outputCol=None)[source]

A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II).

>>> from pyspark.ml.linalg import Vectors
>>> df1 = spark.createDataFrame([(Vectors.dense([5.0, 8.0, 6.0]),)], ["vec"])
>>> dct = DCT( )
>>> dct.setInverse(False)
DCT...
>>> dct.setInputCol("vec")
DCT...
>>> dct.setOutputCol("resultVec")
DCT...
>>> df2 = dct.transform(df1)
>>> df2.head().resultVec
DenseVector([10.969..., -0.707..., -2.041...])
>>> df3 = DCT(inverse=True, inputCol="resultVec", outputCol="origVec").transform(df2)
>>> df3.head().origVec
DenseVector([5.0, 8.0, 6.0])
>>> dctPath = temp_path + "/dct"
>>> dct.save(dctPath)
>>> loadedDtc = DCT.load(dctPath)
>>> loadedDtc.transform(df1).take(1) == dct.transform(df1).take(1)
True
>>> loadedDtc.getInverse()
False

New in version 1.6.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getInputCol()

Gets the value of inputCol or its default value.

getInverse()[source]

Gets the value of inverse or its default value.

New in version 1.6.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
inverse = Param(parent='undefined', name='inverse', doc='Set transformer to perform inverse DCT, default False.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

setInverse(value)[source]

Sets the value of inverse.

New in version 1.6.0.

setOutputCol(value)[source]

Sets the value of outputCol.

setParams(self, inverse=False, inputCol=None, outputCol=None)[source]

Sets params for this DCT.

New in version 1.6.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.ElementwiseProduct(scalingVec=None, inputCol=None, outputCol=None)[source]

Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided “weight” vector. In other words, it scales each column of the dataset by a scalar multiplier.

>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(Vectors.dense([2.0, 1.0, 3.0]),)], ["values"])
>>> ep = ElementwiseProduct()
>>> ep.setScalingVec(Vectors.dense([1.0, 2.0, 3.0]))
ElementwiseProduct...
>>> ep.setInputCol("values")
ElementwiseProduct...
>>> ep.setOutputCol("eprod")
ElementwiseProduct...
>>> ep.transform(df).head().eprod
DenseVector([2.0, 2.0, 9.0])
>>> ep.setParams(scalingVec=Vectors.dense([2.0, 3.0, 5.0])).transform(df).head().eprod
DenseVector([4.0, 3.0, 15.0])
>>> elementwiseProductPath = temp_path + "/elementwise-product"
>>> ep.save(elementwiseProductPath)
>>> loadedEp = ElementwiseProduct.load(elementwiseProductPath)
>>> loadedEp.getScalingVec() == ep.getScalingVec()
True
>>> loadedEp.transform(df).take(1) == ep.transform(df).take(1)
True

New in version 1.5.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getInputCol()

Gets the value of inputCol or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getScalingVec()[source]

Gets the value of scalingVec or its default value.

New in version 2.0.0.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

scalingVec = Param(parent='undefined', name='scalingVec', doc='Vector for hadamard product.')
set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

setOutputCol(value)[source]

Sets the value of outputCol.

setParams(self, scalingVec=None, inputCol=None, outputCol=None)[source]

Sets params for this ElementwiseProduct.

New in version 1.5.0.

setScalingVec(value)[source]

Sets the value of scalingVec.

New in version 2.0.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.FeatureHasher(numFeatures=262144, inputCols=None, outputCol=None, categoricalCols=None)[source]

Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing) to map features to indices in the feature vector.

The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows:

  • Numeric columns:

    For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns in categoricalCols.

  • String columns:

    For categorical features, the hash value of the string “column_name=value” is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are “one-hot” encoded (similarly to using OneHotEncoder with dropLast=false).

  • Boolean columns:

    Boolean values are treated in the same way as string columns. That is, boolean features are represented as “column_name=true” or “column_name=false”, with an indicator value of 1.0.

Null (missing) values are ignored (implicitly zero in the resulting feature vector).

Since a simple modulo is used to transform the hash function to a vector index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector indices.

>>> data = [(2.0, True, "1", "foo"), (3.0, False, "2", "bar")]
>>> cols = ["real", "bool", "stringNum", "string"]
>>> df = spark.createDataFrame(data, cols)
>>> hasher = FeatureHasher()
>>> hasher.setInputCols(cols)
FeatureHasher...
>>> hasher.setOutputCol("features")
FeatureHasher...
>>> hasher.transform(df).head().features
SparseVector(262144, {174475: 2.0, 247670: 1.0, 257907: 1.0, 262126: 1.0})
>>> hasher.setCategoricalCols(["real"]).transform(df).head().features
SparseVector(262144, {171257: 1.0, 247670: 1.0, 257907: 1.0, 262126: 1.0})
>>> hasherPath = temp_path + "/hasher"
>>> hasher.save(hasherPath)
>>> loadedHasher = FeatureHasher.load(hasherPath)
>>> loadedHasher.getNumFeatures() == hasher.getNumFeatures()
True
>>> loadedHasher.transform(df).head().features == hasher.transform(df).head().features
True

New in version 2.3.0.

categoricalCols = Param(parent='undefined', name='categoricalCols', doc='numeric columns to treat as categorical')
clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getCategoricalCols()[source]

Gets the value of binary or its default value.

New in version 2.3.0.

getInputCols()

Gets the value of inputCols or its default value.

getNumFeatures()

Gets the value of numFeatures or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCols = Param(parent='undefined', name='inputCols', doc='input column names.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

numFeatures = Param(parent='undefined', name='numFeatures', doc='Number of features. Should be greater than 0.')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setCategoricalCols(value)[source]

Sets the value of categoricalCols.

New in version 2.3.0.

setInputCols(value)[source]

Sets the value of inputCols.

setNumFeatures(value)[source]

Sets the value of numFeatures.

setOutputCol(value)[source]

Sets the value of outputCol.

setParams(self, numFeatures=1 << 18, inputCols=None, outputCol=None, categoricalCols=None)[source]

Sets params for this FeatureHasher.

New in version 2.3.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.HashingTF(numFeatures=262144, binary=False, inputCol=None, outputCol=None)[source]

Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby’s MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.

>>> df = spark.createDataFrame([(["a", "b", "c"],)], ["words"])
>>> hashingTF = HashingTF(inputCol="words", outputCol="features")
>>> hashingTF.setNumFeatures(10)
HashingTF...
>>> hashingTF.transform(df).head().features
SparseVector(10, {5: 1.0, 7: 1.0, 8: 1.0})
>>> hashingTF.setParams(outputCol="freqs").transform(df).head().freqs
SparseVector(10, {5: 1.0, 7: 1.0, 8: 1.0})
>>> params = {hashingTF.numFeatures: 5, hashingTF.outputCol: "vector"}
>>> hashingTF.transform(df, params).head().vector
SparseVector(5, {0: 1.0, 2: 1.0, 3: 1.0})
>>> hashingTFPath = temp_path + "/hashing-tf"
>>> hashingTF.save(hashingTFPath)
>>> loadedHashingTF = HashingTF.load(hashingTFPath)
>>> loadedHashingTF.getNumFeatures() == hashingTF.getNumFeatures()
True
>>> loadedHashingTF.transform(df).take(1) == hashingTF.transform(df).take(1)
True
>>> hashingTF.indexOf("b")
5

New in version 1.3.0.

binary = Param(parent='undefined', name='binary', doc='If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. Default False.')
clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getBinary()[source]

Gets the value of binary or its default value.

New in version 2.0.0.

getInputCol()

Gets the value of inputCol or its default value.

getNumFeatures()

Gets the value of numFeatures or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

indexOf(term)[source]

Returns the index of the input term.

New in version 3.0.0.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

numFeatures = Param(parent='undefined', name='numFeatures', doc='Number of features. Should be greater than 0.')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setBinary(value)[source]

Sets the value of binary.

New in version 2.0.0.

setInputCol(value)[source]

Sets the value of inputCol.

setNumFeatures(value)[source]

Sets the value of numFeatures.

setOutputCol(value)[source]

Sets the value of outputCol.

setParams(self, numFeatures=1 << 18, binary=False, inputCol=None, outputCol=None)[source]

Sets params for this HashingTF.

New in version 1.3.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.IDF(minDocFreq=0, inputCol=None, outputCol=None)[source]

Compute the Inverse Document Frequency (IDF) given a collection of documents.

>>> from pyspark.ml.linalg import DenseVector
>>> df = spark.createDataFrame([(DenseVector([1.0, 2.0]),),
...     (DenseVector([0.0, 1.0]),), (DenseVector([3.0, 0.2]),)], ["tf"])
>>> idf = IDF(minDocFreq=3)
>>> idf.setInputCol("tf")
IDF...
>>> idf.setOutputCol("idf")
IDF...
>>> model = idf.fit(df)
>>> model.setOutputCol("idf")
IDFModel...
>>> model.getMinDocFreq()
3
>>> model.idf
DenseVector([0.0, 0.0])
>>> model.docFreq
[0, 3]
>>> model.numDocs == df.count()
True
>>> model.transform(df).head().idf
DenseVector([0.0, 0.0])
>>> idf.setParams(outputCol="freqs").fit(df).transform(df).collect()[1].freqs
DenseVector([0.0, 0.0])
>>> params = {idf.minDocFreq: 1, idf.outputCol: "vector"}
>>> idf.fit(df, params).transform(df).head().vector
DenseVector([0.2877, 0.0])
>>> idfPath = temp_path + "/idf"
>>> idf.save(idfPath)
>>> loadedIdf = IDF.load(idfPath)
>>> loadedIdf.getMinDocFreq() == idf.getMinDocFreq()
True
>>> modelPath = temp_path + "/idf-model"
>>> model.save(modelPath)
>>> loadedModel = IDFModel.load(modelPath)
>>> loadedModel.transform(df).head().idf == model.transform(df).head().idf
True

New in version 1.4.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getInputCol()

Gets the value of inputCol or its default value.

getMinDocFreq()

Gets the value of minDocFreq or its default value.

New in version 1.4.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

minDocFreq = Param(parent='undefined', name='minDocFreq', doc='minimum number of documents in which a term should appear for filtering')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

setMinDocFreq(value)[source]

Sets the value of minDocFreq.

New in version 1.4.0.

setOutputCol(value)[source]

Sets the value of outputCol.

setParams(self, minDocFreq=0, inputCol=None, outputCol=None)[source]

Sets params for this IDF.

New in version 1.4.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.IDFModel(java_model=None)[source]

Model fitted by IDF.

New in version 1.4.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

property docFreq

Returns the document frequency.

New in version 3.0.0.

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getInputCol()

Gets the value of inputCol or its default value.

getMinDocFreq()

Gets the value of minDocFreq or its default value.

New in version 1.4.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

property idf

Returns the IDF vector.

New in version 2.0.0.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

minDocFreq = Param(parent='undefined', name='minDocFreq', doc='minimum number of documents in which a term should appear for filtering')
property numDocs

Returns number of documents evaluated to compute idf

New in version 3.0.0.

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

New in version 3.0.0.

setOutputCol(value)[source]

Sets the value of outputCol.

New in version 3.0.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.Imputer(strategy='mean', missingValue=nan, inputCols=None, outputCols=None, inputCol=None, outputCol=None, relativeError=0.001)[source]

Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located. The input columns should be of numeric type. Currently Imputer does not support categorical features and possibly creates incorrect values for a categorical feature.

Note that the mean/median value is computed after filtering out missing values. All Null values in the input columns are treated as missing, and so are also imputed. For computing median, pyspark.sql.DataFrame.approxQuantile() is used with a relative error of 0.001.

>>> df = spark.createDataFrame([(1.0, float("nan")), (2.0, float("nan")), (float("nan"), 3.0),
...                             (4.0, 4.0), (5.0, 5.0)], ["a", "b"])
>>> imputer = Imputer()
>>> imputer.setInputCols(["a", "b"])
Imputer...
>>> imputer.setOutputCols(["out_a", "out_b"])
Imputer...
>>> imputer.getRelativeError()
0.001
>>> model = imputer.fit(df)
>>> model.setInputCols(["a", "b"])
ImputerModel...
>>> model.getStrategy()
'mean'
>>> model.surrogateDF.show()
+---+---+
|  a|  b|
+---+---+
|3.0|4.0|
+---+---+
...
>>> model.transform(df).show()
+---+---+-----+-----+
|  a|  b|out_a|out_b|
+---+---+-----+-----+
|1.0|NaN|  1.0|  4.0|
|2.0|NaN|  2.0|  4.0|
|NaN|3.0|  3.0|  3.0|
...
>>> imputer.setStrategy("median").setMissingValue(1.0).fit(df).transform(df).show()
+---+---+-----+-----+
|  a|  b|out_a|out_b|
+---+---+-----+-----+
|1.0|NaN|  4.0|  NaN|
...
>>> df1 = spark.createDataFrame([(1.0,), (2.0,), (float("nan"),), (4.0,), (5.0,)], ["a"])
>>> imputer1 = Imputer(inputCol="a", outputCol="out_a")
>>> model1 = imputer1.fit(df1)
>>> model1.surrogateDF.show()
+---+
|  a|
+---+
|3.0|
+---+
...
>>> model1.transform(df1).show()
+---+-----+
|  a|out_a|
+---+-----+
|1.0|  1.0|
|2.0|  2.0|
|NaN|  3.0|
...
>>> imputer1.setStrategy("median").setMissingValue(1.0).fit(df1).transform(df1).show()
+---+-----+
|  a|out_a|
+---+-----+
|1.0|  4.0|
...
>>> df2 = spark.createDataFrame([(float("nan"),), (float("nan"),), (3.0,), (4.0,), (5.0,)],
...                             ["b"])
>>> imputer2 = Imputer(inputCol="b", outputCol="out_b")
>>> model2 = imputer2.fit(df2)
>>> model2.surrogateDF.show()
+---+
|  b|
+---+
|4.0|
+---+
...
>>> model2.transform(df2).show()
+---+-----+
|  b|out_b|
+---+-----+
|NaN|  4.0|
|NaN|  4.0|
|3.0|  3.0|
...
>>> imputer2.setStrategy("median").setMissingValue(1.0).fit(df2).transform(df2).show()
+---+-----+
|  b|out_b|
+---+-----+
|NaN|  NaN|
...
>>> imputerPath = temp_path + "/imputer"
>>> imputer.save(imputerPath)
>>> loadedImputer = Imputer.load(imputerPath)
>>> loadedImputer.getStrategy() == imputer.getStrategy()
True
>>> loadedImputer.getMissingValue()
1.0
>>> modelPath = temp_path + "/imputer-model"
>>> model.save(modelPath)
>>> loadedModel = ImputerModel.load(modelPath)
>>> loadedModel.transform(df).head().out_a == model.transform(df).head().out_a
True

New in version 2.2.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getInputCol()

Gets the value of inputCol or its default value.

getInputCols()

Gets the value of inputCols or its default value.

getMissingValue()

Gets the value of missingValue or its default value.

New in version 2.2.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getOutputCols()

Gets the value of outputCols or its default value.

getParam(paramName)

Gets a param by its name.

getRelativeError()

Gets the value of relativeError or its default value.

getStrategy()

Gets the value of strategy or its default value.

New in version 2.2.0.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
inputCols = Param(parent='undefined', name='inputCols', doc='input column names.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

missingValue = Param(parent='undefined', name='missingValue', doc='The placeholder for the missing values. All occurrences of missingValue will be imputed.')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
outputCols = Param(parent='undefined', name='outputCols', doc='output column names.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

relativeError = Param(parent='undefined', name='relativeError', doc='the relative target precision for the approximate quantile algorithm. Must be in the range [0, 1]')
save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

New in version 3.0.0.

setInputCols(value)[source]

Sets the value of inputCols.

New in version 2.2.0.

setMissingValue(value)[source]

Sets the value of missingValue.

New in version 2.2.0.

setOutputCol(value)[source]

Sets the value of outputCol.

New in version 3.0.0.

setOutputCols(value)[source]

Sets the value of outputCols.

New in version 2.2.0.

setParams(self, strategy="mean", missingValue=float("nan"), inputCols=None, outputCols=None, inputCol=None, outputCol=None, relativeError=0.001)[source]

Sets params for this Imputer.

New in version 2.2.0.

setRelativeError(value)[source]

Sets the value of relativeError.

New in version 3.0.0.

setStrategy(value)[source]

Sets the value of strategy.

New in version 2.2.0.

strategy = Param(parent='undefined', name='strategy', doc='strategy for imputation. If mean, then replace missing values using the mean value of the feature. If median, then replace missing values using the median value of the feature.')
write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.ImputerModel(java_model=None)[source]

Model fitted by Imputer.

New in version 2.2.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getInputCol()

Gets the value of inputCol or its default value.

getInputCols()

Gets the value of inputCols or its default value.

getMissingValue()

Gets the value of missingValue or its default value.

New in version 2.2.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getOutputCols()

Gets the value of outputCols or its default value.

getParam(paramName)

Gets a param by its name.

getRelativeError()

Gets the value of relativeError or its default value.

getStrategy()

Gets the value of strategy or its default value.

New in version 2.2.0.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
inputCols = Param(parent='undefined', name='inputCols', doc='input column names.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

missingValue = Param(parent='undefined', name='missingValue', doc='The placeholder for the missing values. All occurrences of missingValue will be imputed.')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
outputCols = Param(parent='undefined', name='outputCols', doc='output column names.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

relativeError = Param(parent='undefined', name='relativeError', doc='the relative target precision for the approximate quantile algorithm. Must be in the range [0, 1]')
save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

New in version 3.0.0.

setInputCols(value)[source]

Sets the value of inputCols.

New in version 3.0.0.

setOutputCol(value)[source]

Sets the value of outputCol.

New in version 3.0.0.

setOutputCols(value)[source]

Sets the value of outputCols.

New in version 3.0.0.

strategy = Param(parent='undefined', name='strategy', doc='strategy for imputation. If mean, then replace missing values using the mean value of the feature. If median, then replace missing values using the median value of the feature.')
property surrogateDF

Returns a DataFrame containing inputCols and their corresponding surrogates, which are used to replace the missing values in the input DataFrame.

New in version 2.2.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.IndexToString(inputCol=None, outputCol=None, labels=None)[source]

A Transformer that maps a column of indices back to a new column of corresponding string values. The index-string mapping is either from the ML attributes of the input column, or from user-supplied labels (which take precedence over ML attributes). See StringIndexer for converting strings into indices.

New in version 1.6.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getInputCol()

Gets the value of inputCol or its default value.

getLabels()[source]

Gets the value of labels or its default value.

New in version 1.6.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

labels = Param(parent='undefined', name='labels', doc='Optional array of labels specifying index-string mapping. If not provided or if empty, then metadata from inputCol is used instead.')
classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

setLabels(value)[source]

Sets the value of labels.

New in version 1.6.0.

setOutputCol(value)[source]

Sets the value of outputCol.

setParams(self, inputCol=None, outputCol=None, labels=None)[source]

Sets params for this IndexToString.

New in version 1.6.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.Interaction(inputCols=None, outputCol=None)[source]

Implements the feature interaction transform. This transformer takes in Double and Vector type columns and outputs a flattened vector of their feature interactions. To handle interaction, we first one-hot encode any nominal features. Then, a vector of the feature cross-products is produced.

For example, given the input feature values Double(2) and Vector(3, 4), the output would be Vector(6, 8) if all input features were numeric. If the first feature was instead nominal with four categories, the output would then be Vector(0, 0, 0, 0, 3, 4, 0, 0).

>>> df = spark.createDataFrame([(0.0, 1.0), (2.0, 3.0)], ["a", "b"])
>>> interaction = Interaction()
>>> interaction.setInputCols(["a", "b"])
Interaction...
>>> interaction.setOutputCol("ab")
Interaction...
>>> interaction.transform(df).show()
+---+---+-----+
|  a|  b|   ab|
+---+---+-----+
|0.0|1.0|[0.0]|
|2.0|3.0|[6.0]|
+---+---+-----+
...
>>> interactionPath = temp_path + "/interaction"
>>> interaction.save(interactionPath)
>>> loadedInteraction = Interaction.load(interactionPath)
>>> loadedInteraction.transform(df).head().ab == interaction.transform(df).head().ab
True

New in version 3.0.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getInputCols()

Gets the value of inputCols or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCols = Param(parent='undefined', name='inputCols', doc='input column names.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCols(value)[source]

Sets the value of inputCols.

New in version 3.0.0.

setOutputCol(value)[source]

Sets the value of outputCol.

New in version 3.0.0.

setParams(self, inputCols=None, outputCol=None)[source]

Sets params for this Interaction.

New in version 3.0.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.MaxAbsScaler(inputCol=None, outputCol=None)[source]

Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.

>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(Vectors.dense([1.0]),), (Vectors.dense([2.0]),)], ["a"])
>>> maScaler = MaxAbsScaler(outputCol="scaled")
>>> maScaler.setInputCol("a")
MaxAbsScaler...
>>> model = maScaler.fit(df)
>>> model.setOutputCol("scaledOutput")
MaxAbsScalerModel...
>>> model.transform(df).show()
+-----+------------+
|    a|scaledOutput|
+-----+------------+
|[1.0]|       [0.5]|
|[2.0]|       [1.0]|
+-----+------------+
...
>>> scalerPath = temp_path + "/max-abs-scaler"
>>> maScaler.save(scalerPath)
>>> loadedMAScaler = MaxAbsScaler.load(scalerPath)
>>> loadedMAScaler.getInputCol() == maScaler.getInputCol()
True
>>> loadedMAScaler.getOutputCol() == maScaler.getOutputCol()
True
>>> modelPath = temp_path + "/max-abs-scaler-model"
>>> model.save(modelPath)
>>> loadedModel = MaxAbsScalerModel.load(modelPath)
>>> loadedModel.maxAbs == model.maxAbs
True
>>> loadedModel.transform(df).take(1) == model.transform(df).take(1)
True

New in version 2.0.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getInputCol()

Gets the value of inputCol or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

setOutputCol(value)[source]

Sets the value of outputCol.

setParams(self, inputCol=None, outputCol=None)[source]

Sets params for this MaxAbsScaler.

New in version 2.0.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.MaxAbsScalerModel(java_model=None)[source]

Model fitted by MaxAbsScaler.

New in version 2.0.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getInputCol()

Gets the value of inputCol or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

property maxAbs

Max Abs vector.

New in version 2.0.0.

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

New in version 3.0.0.

setOutputCol(value)[source]

Sets the value of outputCol.

New in version 3.0.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.MinHashLSH(inputCol=None, outputCol=None, seed=None, numHashTables=1)[source]

LSH class for Jaccard distance. The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example, Vectors.sparse(10, [(2, 1.0), (3, 1.0), (5, 1.0)]) means there are 10 elements in the space. This set contains elements 2, 3, and 5. Also, any input vector must have at least 1 non-zero index, and all non-zero values are treated as binary “1” values.

>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.sql.functions import col
>>> data = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
...         (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
...         (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
>>> df = spark.createDataFrame(data, ["id", "features"])
>>> mh = MinHashLSH()
>>> mh.setInputCol("features")
MinHashLSH...
>>> mh.setOutputCol("hashes")
MinHashLSH...
>>> mh.setSeed(12345)
MinHashLSH...
>>> model = mh.fit(df)
>>> model.setInputCol("features")
MinHashLSHModel...
>>> model.transform(df).head()
Row(id=0, features=SparseVector(6, {0: 1.0, 1: 1.0, 2: 1.0}), hashes=[DenseVector([6179668...
>>> data2 = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
...          (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
...          (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
>>> df2 = spark.createDataFrame(data2, ["id", "features"])
>>> key = Vectors.sparse(6, [1, 2], [1.0, 1.0])
>>> model.approxNearestNeighbors(df2, key, 1).collect()
[Row(id=5, features=SparseVector(6, {1: 1.0, 2: 1.0, 4: 1.0}), hashes=[DenseVector([6179668...
>>> model.approxSimilarityJoin(df, df2, 0.6, distCol="JaccardDistance").select(
...     col("datasetA.id").alias("idA"),
...     col("datasetB.id").alias("idB"),
...     col("JaccardDistance")).show()
+---+---+---------------+
|idA|idB|JaccardDistance|
+---+---+---------------+
|  0|  5|            0.5|
|  1|  4|            0.5|
+---+---+---------------+
...
>>> mhPath = temp_path + "/mh"
>>> mh.save(mhPath)
>>> mh2 = MinHashLSH.load(mhPath)
>>> mh2.getOutputCol() == mh.getOutputCol()
True
>>> modelPath = temp_path + "/mh-model"
>>> model.save(modelPath)
>>> model2 = MinHashLSHModel.load(modelPath)
>>> model.transform(df).head().hashes == model2.transform(df).head().hashes
True

New in version 2.2.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getInputCol()

Gets the value of inputCol or its default value.

getNumHashTables()

Gets the value of numHashTables or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getSeed()

Gets the value of seed or its default value.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

numHashTables = Param(parent='undefined', name='numHashTables', doc='number of hash tables, where increasing number of hash tables lowers the false negative rate, and decreasing it improves the running performance.')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

seed = Param(parent='undefined', name='seed', doc='random seed.')
set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)

Sets the value of inputCol.

setNumHashTables(value)

Sets the value of numHashTables.

setOutputCol(value)

Sets the value of outputCol.

setParams(self, inputCol=None, outputCol=None, seed=None, numHashTables=1)[source]

Sets params for this MinHashLSH.

New in version 2.2.0.

setSeed(value)[source]

Sets the value of seed.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.MinHashLSHModel(java_model=None)[source]

Model produced by MinHashLSH, where where multiple hash functions are stored. Each hash function is picked from the following family of hash functions, where \(a_i\) and \(b_i\) are randomly chosen integers less than prime: \(h_i(x) = ((x \cdot a_i + b_i) \mod prime)\) This hash family is approximately min-wise independent according to the reference.

See also

Tom Bohman, Colin Cooper, and Alan Frieze. “Min-wise independent linear permutations.” Electronic Journal of Combinatorics 7 (2000): R26.

New in version 2.2.0.

approxNearestNeighbors(dataset, key, numNearestNeighbors, distCol='distCol')

Given a large dataset and an item, approximately find at most k items which have the closest distance to the item. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use that. This allows caching of the transformed data when necessary.

Note

This method is experimental and will likely change behavior in the next release.

Parameters
  • dataset – The dataset to search for nearest neighbors of the key.

  • key – Feature vector representing the item to search for.

  • numNearestNeighbors – The maximum number of nearest neighbors.

  • distCol – Output column for storing the distance between each result row and the key. Use “distCol” as default value if it’s not specified.

Returns

A dataset containing at most k items closest to the key. A column “distCol” is added to show the distance between each row and the key.

approxSimilarityJoin(datasetA, datasetB, threshold, distCol='distCol')

Join two datasets to approximately find all pairs of rows whose distance are smaller than the threshold. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use that. This allows caching of the transformed data when necessary.

Parameters
  • datasetA – One of the datasets to join.

  • datasetB – Another dataset to join.

  • threshold – The threshold for the distance of row pairs.

  • distCol – Output column for storing the distance between each pair of rows. Use “distCol” as default value if it’s not specified.

Returns

A joined dataset containing pairs of rows. The original rows are in columns “datasetA” and “datasetB”, and a column “distCol” is added to show the distance between each pair.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getInputCol()

Gets the value of inputCol or its default value.

getNumHashTables()

Gets the value of numHashTables or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

numHashTables = Param(parent='undefined', name='numHashTables', doc='number of hash tables, where increasing number of hash tables lowers the false negative rate, and decreasing it improves the running performance.')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)

Sets the value of inputCol.

setOutputCol(value)

Sets the value of outputCol.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.MinMaxScaler(min=0.0, max=1.0, inputCol=None, outputCol=None)[source]

Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling. The rescaled value for feature E is calculated as,

Rescaled(e_i) = (e_i - E_min) / (E_max - E_min) * (max - min) + min

For the case E_max == E_min, Rescaled(e_i) = 0.5 * (max + min)

Note

Since zero values will probably be transformed to non-zero values, output of the transformer will be DenseVector even for sparse input.

>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(Vectors.dense([0.0]),), (Vectors.dense([2.0]),)], ["a"])
>>> mmScaler = MinMaxScaler(outputCol="scaled")
>>> mmScaler.setInputCol("a")
MinMaxScaler...
>>> model = mmScaler.fit(df)
>>> model.setOutputCol("scaledOutput")
MinMaxScalerModel...
>>> model.originalMin
DenseVector([0.0])
>>> model.originalMax
DenseVector([2.0])
>>> model.transform(df).show()
+-----+------------+
|    a|scaledOutput|
+-----+------------+
|[0.0]|       [0.0]|
|[2.0]|       [1.0]|
+-----+------------+
...
>>> minMaxScalerPath = temp_path + "/min-max-scaler"
>>> mmScaler.save(minMaxScalerPath)
>>> loadedMMScaler = MinMaxScaler.load(minMaxScalerPath)
>>> loadedMMScaler.getMin() == mmScaler.getMin()
True
>>> loadedMMScaler.getMax() == mmScaler.getMax()
True
>>> modelPath = temp_path + "/min-max-scaler-model"
>>> model.save(modelPath)
>>> loadedModel = MinMaxScalerModel.load(modelPath)
>>> loadedModel.originalMin == model.originalMin
True
>>> loadedModel.originalMax == model.originalMax
True
>>> loadedModel.transform(df).take(1) == model.transform(df).take(1)
True

New in version 1.6.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getInputCol()

Gets the value of inputCol or its default value.

getMax()

Gets the value of max or its default value.

New in version 1.6.0.

getMin()

Gets the value of min or its default value.

New in version 1.6.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

max = Param(parent='undefined', name='max', doc='Upper bound of the output feature range')
min = Param(parent='undefined', name='min', doc='Lower bound of the output feature range')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

setMax(value)[source]

Sets the value of max.

New in version 1.6.0.

setMin(value)[source]

Sets the value of min.

New in version 1.6.0.

setOutputCol(value)[source]

Sets the value of outputCol.

setParams(self, min=0.0, max=1.0, inputCol=None, outputCol=None)[source]

Sets params for this MinMaxScaler.

New in version 1.6.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.MinMaxScalerModel(java_model=None)[source]

Model fitted by MinMaxScaler.

New in version 1.6.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getInputCol()

Gets the value of inputCol or its default value.

getMax()

Gets the value of max or its default value.

New in version 1.6.0.

getMin()

Gets the value of min or its default value.

New in version 1.6.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

max = Param(parent='undefined', name='max', doc='Upper bound of the output feature range')
min = Param(parent='undefined', name='min', doc='Lower bound of the output feature range')
property originalMax

Max value for each original column during fitting.

New in version 2.0.0.

property originalMin

Min value for each original column during fitting.

New in version 2.0.0.

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

New in version 3.0.0.

setMax(value)[source]

Sets the value of max.

New in version 3.0.0.

setMin(value)[source]

Sets the value of min.

New in version 3.0.0.

setOutputCol(value)[source]

Sets the value of outputCol.

New in version 3.0.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.NGram(n=2, inputCol=None, outputCol=None)[source]

A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words. When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.

>>> df = spark.createDataFrame([Row(inputTokens=["a", "b", "c", "d", "e"])])
>>> ngram = NGram(n=2)
>>> ngram.setInputCol("inputTokens")
NGram...
>>> ngram.setOutputCol("nGrams")
NGram...
>>> ngram.transform(df).head()
Row(inputTokens=['a', 'b', 'c', 'd', 'e'], nGrams=['a b', 'b c', 'c d', 'd e'])
>>> # Change n-gram length
>>> ngram.setParams(n=4).transform(df).head()
Row(inputTokens=['a', 'b', 'c', 'd', 'e'], nGrams=['a b c d', 'b c d e'])
>>> # Temporarily modify output column.
>>> ngram.transform(df, {ngram.outputCol: "output"}).head()
Row(inputTokens=['a', 'b', 'c', 'd', 'e'], output=['a b c d', 'b c d e'])
>>> ngram.transform(df).head()
Row(inputTokens=['a', 'b', 'c', 'd', 'e'], nGrams=['a b c d', 'b c d e'])
>>> # Must use keyword arguments to specify params.
>>> ngram.setParams("text")
Traceback (most recent call last):
    ...
TypeError: Method setParams forces keyword arguments.
>>> ngramPath = temp_path + "/ngram"
>>> ngram.save(ngramPath)
>>> loadedNGram = NGram.load(ngramPath)
>>> loadedNGram.getN() == ngram.getN()
True
>>> loadedNGram.transform(df).take(1) == ngram.transform(df).take(1)
True

New in version 1.5.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getInputCol()

Gets the value of inputCol or its default value.

getN()[source]

Gets the value of n or its default value.

New in version 1.5.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

n = Param(parent='undefined', name='n', doc='number of elements per n-gram (>=1)')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

setN(value)[source]

Sets the value of n.

New in version 1.5.0.

setOutputCol(value)[source]

Sets the value of outputCol.

setParams(self, n=2, inputCol=None, outputCol=None)[source]

Sets params for this NGram.

New in version 1.5.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.Normalizer(p=2.0, inputCol=None, outputCol=None)[source]

Normalize a vector to have unit norm using the given p-norm.

>>> from pyspark.ml.linalg import Vectors
>>> svec = Vectors.sparse(4, {1: 4.0, 3: 3.0})
>>> df = spark.createDataFrame([(Vectors.dense([3.0, -4.0]), svec)], ["dense", "sparse"])
>>> normalizer = Normalizer(p=2.0)
>>> normalizer.setInputCol("dense")
Normalizer...
>>> normalizer.setOutputCol("features")
Normalizer...
>>> normalizer.transform(df).head().features
DenseVector([0.6, -0.8])
>>> normalizer.setParams(inputCol="sparse", outputCol="freqs").transform(df).head().freqs
SparseVector(4, {1: 0.8, 3: 0.6})
>>> params = {normalizer.p: 1.0, normalizer.inputCol: "dense", normalizer.outputCol: "vector"}
>>> normalizer.transform(df, params).head().vector
DenseVector([0.4286, -0.5714])
>>> normalizerPath = temp_path + "/normalizer"
>>> normalizer.save(normalizerPath)
>>> loadedNormalizer = Normalizer.load(normalizerPath)
>>> loadedNormalizer.getP() == normalizer.getP()
True
>>> loadedNormalizer.transform(df).take(1) == normalizer.transform(df).take(1)
True

New in version 1.4.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getInputCol()

Gets the value of inputCol or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getP()[source]

Gets the value of p or its default value.

New in version 1.4.0.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
p = Param(parent='undefined', name='p', doc='the p norm value.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

setOutputCol(value)[source]

Sets the value of outputCol.

setP(value)[source]

Sets the value of p.

New in version 1.4.0.

setParams(self, p=2.0, inputCol=None, outputCol=None)[source]

Sets params for this Normalizer.

New in version 1.4.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.OneHotEncoder(inputCols=None, outputCols=None, handleInvalid='error', dropLast=True, inputCol=None, outputCol=None)[source]

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

Note

This is different from scikit-learn’s OneHotEncoder, which keeps all categories. The output vectors are sparse.

When handleInvalid is configured to ‘keep’, an extra “category” indicating invalid values is added as last category. So when dropLast is true, invalid values are encoded as all-zeros vector.

Note

When encoding multi-column by using inputCols and outputCols params, input/output cols come in pairs, specified by the order in the arrays, and each pair is treated independently.

See also

StringIndexer for converting categorical values into category indices

>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(0.0,), (1.0,), (2.0,)], ["input"])
>>> ohe = OneHotEncoder()
>>> ohe.setInputCols(["input"])
OneHotEncoder...
>>> ohe.setOutputCols(["output"])
OneHotEncoder...
>>> model = ohe.fit(df)
>>> model.setOutputCols(["output"])
OneHotEncoderModel...
>>> model.getHandleInvalid()
'error'
>>> model.transform(df).head().output
SparseVector(2, {0: 1.0})
>>> single_col_ohe = OneHotEncoder(inputCol="input", outputCol="output")
>>> single_col_model = single_col_ohe.fit(df)
>>> single_col_model.transform(df).head().output
SparseVector(2, {0: 1.0})
>>> ohePath = temp_path + "/ohe"
>>> ohe.save(ohePath)
>>> loadedOHE = OneHotEncoder.load(ohePath)
>>> loadedOHE.getInputCols() == ohe.getInputCols()
True
>>> modelPath = temp_path + "/ohe-model"
>>> model.save(modelPath)
>>> loadedModel = OneHotEncoderModel.load(modelPath)
>>> loadedModel.categorySizes == model.categorySizes
True
>>> loadedModel.transform(df).take(1) == model.transform(df).take(1)
True

New in version 2.3.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

dropLast = Param(parent='undefined', name='dropLast', doc='whether to drop the last category')
explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getDropLast()

Gets the value of dropLast or its default value.

New in version 2.3.0.

getHandleInvalid()

Gets the value of handleInvalid or its default value.

getInputCol()

Gets the value of inputCol or its default value.

getInputCols()

Gets the value of inputCols or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getOutputCols()

Gets the value of outputCols or its default value.

getParam(paramName)

Gets a param by its name.

handleInvalid = Param(parent='undefined', name='handleInvalid', doc="How to handle invalid data during transform(). Options are 'keep' (invalid data presented as an extra categorical feature) or error (throw an error). Note that this Param is only used during transform; during fitting, invalid data will result in an error.")
hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
inputCols = Param(parent='undefined', name='inputCols', doc='input column names.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
outputCols = Param(parent='undefined', name='outputCols', doc='output column names.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setDropLast(value)[source]

Sets the value of dropLast.

New in version 2.3.0.

setHandleInvalid(value)[source]

Sets the value of handleInvalid.

New in version 3.0.0.

setInputCol(value)[source]

Sets the value of inputCol.

New in version 3.0.0.

setInputCols(value)[source]

Sets the value of inputCols.

New in version 3.0.0.

setOutputCol(value)[source]

Sets the value of outputCol.

New in version 3.0.0.

setOutputCols(value)[source]

Sets the value of outputCols.

New in version 3.0.0.

setParams(self, inputCols=None, outputCols=None, handleInvalid="error", dropLast=True, inputCol=None, outputCol=None)[source]

Sets params for this OneHotEncoder.

New in version 2.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.OneHotEncoderModel(java_model=None)[source]

Model fitted by OneHotEncoder.

New in version 2.3.0.

property categorySizes

Original number of categories for each feature being encoded. The array contains one value for each input column, in order.

New in version 2.3.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

dropLast = Param(parent='undefined', name='dropLast', doc='whether to drop the last category')
explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getDropLast()

Gets the value of dropLast or its default value.

New in version 2.3.0.

getHandleInvalid()

Gets the value of handleInvalid or its default value.

getInputCol()

Gets the value of inputCol or its default value.

getInputCols()

Gets the value of inputCols or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getOutputCols()

Gets the value of outputCols or its default value.

getParam(paramName)

Gets a param by its name.

handleInvalid = Param(parent='undefined', name='handleInvalid', doc="How to handle invalid data during transform(). Options are 'keep' (invalid data presented as an extra categorical feature) or error (throw an error). Note that this Param is only used during transform; during fitting, invalid data will result in an error.")
hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
inputCols = Param(parent='undefined', name='inputCols', doc='input column names.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
outputCols = Param(parent='undefined', name='outputCols', doc='output column names.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setDropLast(value)[source]

Sets the value of dropLast.

New in version 3.0.0.

setHandleInvalid(value)[source]

Sets the value of handleInvalid.

New in version 3.0.0.

setInputCol(value)[source]

Sets the value of inputCol.

New in version 3.0.0.

setInputCols(value)[source]

Sets the value of inputCols.

New in version 3.0.0.

setOutputCol(value)[source]

Sets the value of outputCol.

New in version 3.0.0.

setOutputCols(value)[source]

Sets the value of outputCols.

New in version 3.0.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.PCA(k=None, inputCol=None, outputCol=None)[source]

PCA trains a model to project vectors to a lower dimensional space of the top k principal components.

>>> from pyspark.ml.linalg import Vectors
>>> data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
...     (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
...     (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
>>> df = spark.createDataFrame(data,["features"])
>>> pca = PCA(k=2, inputCol="features")
>>> pca.setOutputCol("pca_features")
PCA...
>>> model = pca.fit(df)
>>> model.getK()
2
>>> model.setOutputCol("output")
PCAModel...
>>> model.transform(df).collect()[0].output
DenseVector([1.648..., -4.013...])
>>> model.explainedVariance
DenseVector([0.794..., 0.205...])
>>> pcaPath = temp_path + "/pca"
>>> pca.save(pcaPath)
>>> loadedPca = PCA.load(pcaPath)
>>> loadedPca.getK() == pca.getK()
True
>>> modelPath = temp_path + "/pca-model"
>>> model.save(modelPath)
>>> loadedModel = PCAModel.load(modelPath)
>>> loadedModel.pc == model.pc
True
>>> loadedModel.explainedVariance == model.explainedVariance
True
>>> loadedModel.transform(df).take(1) == model.transform(df).take(1)
True

New in version 1.5.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getInputCol()

Gets the value of inputCol or its default value.

getK()

Gets the value of k or its default value.

New in version 1.5.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

k = Param(parent='undefined', name='k', doc='the number of principal components')
classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

setK(value)[source]

Sets the value of k.

New in version 1.5.0.

setOutputCol(value)[source]

Sets the value of outputCol.

setParams(self, k=None, inputCol=None, outputCol=None)[source]

Set params for this PCA.

New in version 1.5.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.PCAModel(java_model=None)[source]

Model fitted by PCA. Transforms vectors to a lower dimensional space.

New in version 1.5.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

property explainedVariance

Returns a vector of proportions of variance explained by each principal component.

New in version 2.0.0.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getInputCol()

Gets the value of inputCol or its default value.

getK()

Gets the value of k or its default value.

New in version 1.5.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

k = Param(parent='undefined', name='k', doc='the number of principal components')
classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

property pc

Returns a principal components Matrix. Each column is one principal component.

New in version 2.0.0.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

New in version 3.0.0.

setOutputCol(value)[source]

Sets the value of outputCol.

New in version 3.0.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.PolynomialExpansion(degree=2, inputCol=None, outputCol=None)[source]

Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion, “In mathematics, an expansion of a product of sums expresses it as a sum of products by using the fact that multiplication distributes over addition”. Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y).

>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(Vectors.dense([0.5, 2.0]),)], ["dense"])
>>> px = PolynomialExpansion(degree=2)
>>> px.setInputCol("dense")
PolynomialExpansion...
>>> px.setOutputCol("expanded")
PolynomialExpansion...
>>> px.transform(df).head().expanded
DenseVector([0.5, 0.25, 2.0, 1.0, 4.0])
>>> px.setParams(outputCol="test").transform(df).head().test
DenseVector([0.5, 0.25, 2.0, 1.0, 4.0])
>>> polyExpansionPath = temp_path + "/poly-expansion"
>>> px.save(polyExpansionPath)
>>> loadedPx = PolynomialExpansion.load(polyExpansionPath)
>>> loadedPx.getDegree() == px.getDegree()
True
>>> loadedPx.transform(df).take(1) == px.transform(df).take(1)
True

New in version 1.4.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

degree = Param(parent='undefined', name='degree', doc='the polynomial degree to expand (>= 1)')
explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getDegree()[source]

Gets the value of degree or its default value.

New in version 1.4.0.

getInputCol()

Gets the value of inputCol or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setDegree(value)[source]

Sets the value of degree.

New in version 1.4.0.

setInputCol(value)[source]

Sets the value of inputCol.

setOutputCol(value)[source]

Sets the value of outputCol.

setParams(self, degree=2, inputCol=None, outputCol=None)[source]

Sets params for this PolynomialExpansion.

New in version 1.4.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.QuantileDiscretizer(numBuckets=2, inputCol=None, outputCol=None, relativeError=0.001, handleInvalid='error', numBucketsArray=None, inputCols=None, outputCols=None)[source]

QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be less than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 3.0.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the numBucketsArray parameter can be set, or if the number of buckets should be the same across columns, numBuckets can be set as a convenience.

NaN handling: Note also that QuantileDiscretizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid parameter. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for approxQuantile() for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values.

>>> values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), (float("nan"),)]
>>> df1 = spark.createDataFrame(values, ["values"])
>>> qds1 = QuantileDiscretizer(inputCol="values", outputCol="buckets")
>>> qds1.setNumBuckets(2)
QuantileDiscretizer...
>>> qds1.setRelativeError(0.01)
QuantileDiscretizer...
>>> qds1.setHandleInvalid("error")
QuantileDiscretizer...
>>> qds1.getRelativeError()
0.01
>>> bucketizer = qds1.fit(df1)
>>> qds1.setHandleInvalid("keep").fit(df1).transform(df1).count()
6
>>> qds1.setHandleInvalid("skip").fit(df1).transform(df1).count()
4
>>> splits = bucketizer.getSplits()
>>> splits[0]
-inf
>>> print("%2.1f" % round(splits[1], 1))
0.4
>>> bucketed = bucketizer.transform(df1).head()
>>> bucketed.buckets
0.0
>>> quantileDiscretizerPath = temp_path + "/quantile-discretizer"
>>> qds1.save(quantileDiscretizerPath)
>>> loadedQds = QuantileDiscretizer.load(quantileDiscretizerPath)
>>> loadedQds.getNumBuckets() == qds1.getNumBuckets()
True
>>> inputs = [(0.1, 0.0), (0.4, 1.0), (1.2, 1.3), (1.5, 1.5),
...     (float("nan"), float("nan")), (float("nan"), float("nan"))]
>>> df2 = spark.createDataFrame(inputs, ["input1", "input2"])
>>> qds2 = QuantileDiscretizer(relativeError=0.01, handleInvalid="error", numBuckets=2,
...     inputCols=["input1", "input2"], outputCols=["output1", "output2"])
>>> qds2.getRelativeError()
0.01
>>> qds2.setHandleInvalid("keep").fit(df2).transform(df2).show()
+------+------+-------+-------+
|input1|input2|output1|output2|
+------+------+-------+-------+
|   0.1|   0.0|    0.0|    0.0|
|   0.4|   1.0|    1.0|    1.0|
|   1.2|   1.3|    1.0|    1.0|
|   1.5|   1.5|    1.0|    1.0|
|   NaN|   NaN|    2.0|    2.0|
|   NaN|   NaN|    2.0|    2.0|
+------+------+-------+-------+
...
>>> qds3 = QuantileDiscretizer(relativeError=0.01, handleInvalid="error",
...      numBucketsArray=[5, 10], inputCols=["input1", "input2"],
...      outputCols=["output1", "output2"])
>>> qds3.setHandleInvalid("skip").fit(df2).transform(df2).show()
+------+------+-------+-------+
|input1|input2|output1|output2|
+------+------+-------+-------+
|   0.1|   0.0|    1.0|    1.0|
|   0.4|   1.0|    2.0|    2.0|
|   1.2|   1.3|    3.0|    3.0|
|   1.5|   1.5|    4.0|    4.0|
+------+------+-------+-------+
...

New in version 2.0.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getHandleInvalid()

Gets the value of handleInvalid or its default value.

getInputCol()

Gets the value of inputCol or its default value.

getInputCols()

Gets the value of inputCols or its default value.

getNumBuckets()[source]

Gets the value of numBuckets or its default value.

New in version 2.0.0.

getNumBucketsArray()[source]

Gets the value of numBucketsArray or its default value.

New in version 3.0.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getOutputCols()

Gets the value of outputCols or its default value.

getParam(paramName)

Gets a param by its name.

getRelativeError()

Gets the value of relativeError or its default value.

handleInvalid = Param(parent='undefined', name='handleInvalid', doc="how to handle invalid entries. Options are skip (filter out rows with invalid values), error (throw an error), or keep (keep invalid values in a special additional bucket). Note that in the multiple columns case, the invalid handling is applied to all columns. That said for 'error' it will throw an error if any invalids are found in any columns, for 'skip' it will skip rows with any invalids in any columns, etc.")
hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
inputCols = Param(parent='undefined', name='inputCols', doc='input column names.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

numBuckets = Param(parent='undefined', name='numBuckets', doc='Maximum number of buckets (quantiles, or categories) into which data points are grouped. Must be >= 2.')
numBucketsArray = Param(parent='undefined', name='numBucketsArray', doc='Array of number of buckets (quantiles, or categories) into which data points are grouped. This is for multiple columns input. If transforming multiple columns and numBucketsArray is not set, but numBuckets is set, then numBuckets will be applied across all columns.')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
outputCols = Param(parent='undefined', name='outputCols', doc='output column names.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

relativeError = Param(parent='undefined', name='relativeError', doc='the relative target precision for the approximate quantile algorithm. Must be in the range [0, 1]')
save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setHandleInvalid(value)[source]

Sets the value of handleInvalid.

setInputCol(value)[source]

Sets the value of inputCol.

setInputCols(value)[source]

Sets the value of inputCols.

New in version 3.0.0.

setNumBuckets(value)[source]

Sets the value of numBuckets.

New in version 2.0.0.

setNumBucketsArray(value)[source]

Sets the value of numBucketsArray.

New in version 3.0.0.

setOutputCol(value)[source]

Sets the value of outputCol.

setOutputCols(value)[source]

Sets the value of outputCols.

New in version 3.0.0.

setParams(self, numBuckets=2, inputCol=None, outputCol=None, relativeError=0.001, handleInvalid="error", numBucketsArray=None, inputCols=None, outputCols=None)[source]

Set the params for the QuantileDiscretizer

New in version 2.0.0.

setRelativeError(value)[source]

Sets the value of relativeError.

New in version 2.0.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.RobustScaler(lower=0.25, upper=0.75, withCentering=False, withScaling=True, inputCol=None, outputCol=None, relativeError=0.001)[source]

RobustScaler removes the median and scales the data according to the quantile range. The quantile range is by default IQR (Interquartile Range, quantile range between the 1st quartile = 25th quantile and the 3rd quartile = 75th quantile) but can be configured. Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and quantile range are then stored to be used on later data using the transform method. Note that NaN values are ignored in the computation of medians and ranges.

>>> from pyspark.ml.linalg import Vectors
>>> data = [(0, Vectors.dense([0.0, 0.0]),),
...         (1, Vectors.dense([1.0, -1.0]),),
...         (2, Vectors.dense([2.0, -2.0]),),
...         (3, Vectors.dense([3.0, -3.0]),),
...         (4, Vectors.dense([4.0, -4.0]),),]
>>> df = spark.createDataFrame(data, ["id", "features"])
>>> scaler = RobustScaler()
>>> scaler.setInputCol("features")
RobustScaler...
>>> scaler.setOutputCol("scaled")
RobustScaler...
>>> model = scaler.fit(df)
>>> model.setOutputCol("output")
RobustScalerModel...
>>> model.median
DenseVector([2.0, -2.0])
>>> model.range
DenseVector([2.0, 2.0])
>>> model.transform(df).collect()[1].output
DenseVector([0.5, -0.5])
>>> scalerPath = temp_path + "/robust-scaler"
>>> scaler.save(scalerPath)
>>> loadedScaler = RobustScaler.load(scalerPath)
>>> loadedScaler.getWithCentering() == scaler.getWithCentering()
True
>>> loadedScaler.getWithScaling() == scaler.getWithScaling()
True
>>> modelPath = temp_path + "/robust-scaler-model"
>>> model.save(modelPath)
>>> loadedModel = RobustScalerModel.load(modelPath)
>>> loadedModel.median == model.median
True
>>> loadedModel.range == model.range
True
>>> loadedModel.transform(df).take(1) == model.transform(df).take(1)
True

New in version 3.0.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getInputCol()

Gets the value of inputCol or its default value.

getLower()

Gets the value of lower or its default value.

New in version 3.0.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getRelativeError()

Gets the value of relativeError or its default value.

getUpper()

Gets the value of upper or its default value.

New in version 3.0.0.

getWithCentering()

Gets the value of withCentering or its default value.

New in version 3.0.0.

getWithScaling()

Gets the value of withScaling or its default value.

New in version 3.0.0.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

lower = Param(parent='undefined', name='lower', doc='Lower quantile to calculate quantile range')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

relativeError = Param(parent='undefined', name='relativeError', doc='the relative target precision for the approximate quantile algorithm. Must be in the range [0, 1]')
save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

New in version 3.0.0.

setLower(value)[source]

Sets the value of lower.

New in version 3.0.0.

setOutputCol(value)[source]

Sets the value of outputCol.

New in version 3.0.0.

setParams(self, lower=0.25, upper=0.75, withCentering=False, withScaling=True, inputCol=None, outputCol=None, relativeError=0.001)[source]

Sets params for this RobustScaler.

New in version 3.0.0.

setRelativeError(value)[source]

Sets the value of relativeError.

New in version 3.0.0.

setUpper(value)[source]

Sets the value of upper.

New in version 3.0.0.

setWithCentering(value)[source]

Sets the value of withCentering.

New in version 3.0.0.

setWithScaling(value)[source]

Sets the value of withScaling.

New in version 3.0.0.

upper = Param(parent='undefined', name='upper', doc='Upper quantile to calculate quantile range')
withCentering = Param(parent='undefined', name='withCentering', doc='Whether to center data with median')
withScaling = Param(parent='undefined', name='withScaling', doc='Whether to scale the data to quantile range')
write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.RobustScalerModel(java_model=None)[source]

Model fitted by RobustScaler.

New in version 3.0.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getInputCol()

Gets the value of inputCol or its default value.

getLower()

Gets the value of lower or its default value.

New in version 3.0.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getRelativeError()

Gets the value of relativeError or its default value.

getUpper()

Gets the value of upper or its default value.

New in version 3.0.0.

getWithCentering()

Gets the value of withCentering or its default value.

New in version 3.0.0.

getWithScaling()

Gets the value of withScaling or its default value.

New in version 3.0.0.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

lower = Param(parent='undefined', name='lower', doc='Lower quantile to calculate quantile range')
property median

Median of the RobustScalerModel.

New in version 3.0.0.

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

property range

Quantile range of the RobustScalerModel.

New in version 3.0.0.

classmethod read()

Returns an MLReader instance for this class.

relativeError = Param(parent='undefined', name='relativeError', doc='the relative target precision for the approximate quantile algorithm. Must be in the range [0, 1]')
save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

New in version 3.0.0.

setOutputCol(value)[source]

Sets the value of outputCol.

New in version 3.0.0.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

upper = Param(parent='undefined', name='upper', doc='Upper quantile to calculate quantile range')
withCentering = Param(parent='undefined', name='withCentering', doc='Whether to center data with median')
withScaling = Param(parent='undefined', name='withScaling', doc='Whether to scale the data to quantile range')
write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.RegexTokenizer(minTokenLength=1, gaps=True, pattern='\s+', inputCol=None, outputCol=None, toLowercase=True)[source]

A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.

>>> df = spark.createDataFrame([("A B  c",)], ["text"])
>>> reTokenizer = RegexTokenizer()
>>> reTokenizer.setInputCol("text")
RegexTokenizer...
>>> reTokenizer.setOutputCol("words")
RegexTokenizer...
>>> reTokenizer.transform(df).head()
Row(text='A B  c', words=['a', 'b', 'c'])
>>> # Change a parameter.
>>> reTokenizer.setParams(outputCol="tokens").transform(df).head()
Row(text='A B  c', tokens=['a', 'b', 'c'])
>>> # Temporarily modify a parameter.
>>> reTokenizer.transform(df, {reTokenizer.outputCol: "words"}).head()
Row(text='A B  c', words=['a', 'b', 'c'])
>>> reTokenizer.transform(df).head()
Row(text='A B  c', tokens=['a', 'b', 'c'])
>>> # Must use keyword arguments to specify params.
>>> reTokenizer.setParams("text")
Traceback (most recent call last):
    ...
TypeError: Method setParams forces keyword arguments.
>>> regexTokenizerPath = temp_path + "/regex-tokenizer"
>>> reTokenizer.save(regexTokenizerPath)
>>> loadedReTokenizer = RegexTokenizer.load(regexTokenizerPath)
>>> loadedReTokenizer.getMinTokenLength() == reTokenizer.getMinTokenLength()
True
>>> loadedReTokenizer.getGaps() == reTokenizer.getGaps()
True
>>> loadedReTokenizer.transform(df).take(1) == reTokenizer.transform(df).take(1)
True

New in version 1.4.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

gaps = Param(parent='undefined', name='gaps', doc='whether regex splits on gaps (True) or matches tokens (False)')
getGaps()[source]

Gets the value of gaps or its default value.

New in version 1.4.0.

getInputCol()

Gets the value of inputCol or its default value.

getMinTokenLength()[source]

Gets the value of minTokenLength or its default value.

New in version 1.4.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getPattern()[source]

Gets the value of pattern or its default value.

New in version 1.4.0.

getToLowercase()[source]

Gets the value of toLowercase or its default value.

New in version 2.0.0.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

minTokenLength = Param(parent='undefined', name='minTokenLength', doc='minimum token length (>= 0)')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

pattern = Param(parent='undefined', name='pattern', doc='regex pattern (Java dialect) used for tokenizing')
classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setGaps(value)[source]

Sets the value of gaps.

New in version 1.4.0.

setInputCol(value)[source]

Sets the value of inputCol.

setMinTokenLength(value)[source]

Sets the value of minTokenLength.

New in version 1.4.0.

setOutputCol(value)[source]

Sets the value of outputCol.

setParams(self, minTokenLength=1, gaps=True, pattern="s+", inputCol=None, outputCol=None, toLowercase=True)[source]

Sets params for this RegexTokenizer.

New in version 1.4.0.

setPattern(value)[source]

Sets the value of pattern.

New in version 1.4.0.

setToLowercase(value)[source]

Sets the value of toLowercase.

New in version 2.0.0.

toLowercase = Param(parent='undefined', name='toLowercase', doc='whether to convert all characters to lowercase before tokenizing')
transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.RFormula(formula=None, featuresCol='features', labelCol='label', forceIndexLabel=False, stringIndexerOrderType='frequencyDesc', handleInvalid='error')[source]

Implements the transforms required for fitting a dataset against an R model formula. Currently we support a limited subset of the R operators, including ‘~’, ‘.’, ‘:’, ‘+’, ‘-‘, ‘*’, and ‘^’. Also see the R formula docs.

>>> df = spark.createDataFrame([
...     (1.0, 1.0, "a"),
...     (0.0, 2.0, "b"),
...     (0.0, 0.0, "a")
... ], ["y", "x", "s"])
>>> rf = RFormula(formula="y ~ x + s")
>>> model = rf.fit(df)
>>> model.getLabelCol()
'label'
>>> model.transform(df).show()
+---+---+---+---------+-----+
|  y|  x|  s| features|label|
+---+---+---+---------+-----+
|1.0|1.0|  a|[1.0,1.0]|  1.0|
|0.0|2.0|  b|[2.0,0.0]|  0.0|
|0.0|0.0|  a|[0.0,1.0]|  0.0|
+---+---+---+---------+-----+
...
>>> rf.fit(df, {rf.formula: "y ~ . - s"}).transform(df).show()
+---+---+---+--------+-----+
|  y|  x|  s|features|label|
+---+---+---+--------+-----+
|1.0|1.0|  a|   [1.0]|  1.0|
|0.0|2.0|  b|   [2.0]|  0.0|
|0.0|0.0|  a|   [0.0]|  0.0|
+---+---+---+--------+-----+
...
>>> rFormulaPath = temp_path + "/rFormula"
>>> rf.save(rFormulaPath)
>>> loadedRF = RFormula.load(rFormulaPath)
>>> loadedRF.getFormula() == rf.getFormula()
True
>>> loadedRF.getFeaturesCol() == rf.getFeaturesCol()
True
>>> loadedRF.getLabelCol() == rf.getLabelCol()
True
>>> loadedRF.getHandleInvalid() == rf.getHandleInvalid()
True
>>> str(loadedRF)
'RFormula(y ~ x + s) (uid=...)'
>>> modelPath = temp_path + "/rFormulaModel"
>>> model.save(modelPath)
>>> loadedModel = RFormulaModel.load(modelPath)
>>> loadedModel.uid == model.uid
True
>>> loadedModel.transform(df).show()
+---+---+---+---------+-----+
|  y|  x|  s| features|label|
+---+---+---+---------+-----+
|1.0|1.0|  a|[1.0,1.0]|  1.0|
|0.0|2.0|  b|[2.0,0.0]|  0.0|
|0.0|0.0|  a|[0.0,1.0]|  0.0|
+---+---+---+---------+-----+
...
>>> str(loadedModel)
'RFormulaModel(ResolvedRFormula(label=y, terms=[x,s], hasIntercept=true)) (uid=...)'

New in version 1.5.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name.')
fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

forceIndexLabel = Param(parent='undefined', name='forceIndexLabel', doc='Force to index label whether it is numeric or string')
formula = Param(parent='undefined', name='formula', doc='R model formula')
getFeaturesCol()

Gets the value of featuresCol or its default value.

getForceIndexLabel()

Gets the value of forceIndexLabel.

New in version 2.1.0.

getFormula()

Gets the value of formula.

New in version 1.5.0.

getHandleInvalid()

Gets the value of handleInvalid or its default value.

getLabelCol()

Gets the value of labelCol or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)

Gets a param by its name.

getStringIndexerOrderType()

Gets the value of stringIndexerOrderType or its default value ‘frequencyDesc’.

New in version 2.3.0.

handleInvalid = Param(parent='undefined', name='handleInvalid', doc="how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (put invalid data in a special additional bucket, at index numLabels).")
hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

labelCol = Param(parent='undefined', name='labelCol', doc='label column name.')
classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setFeaturesCol(value)[source]

Sets the value of featuresCol.

setForceIndexLabel(value)[source]

Sets the value of forceIndexLabel.

New in version 2.1.0.

setFormula(value)[source]

Sets the value of formula.

New in version 1.5.0.

setHandleInvalid(value)[source]

Sets the value of handleInvalid.

setLabelCol(value)[source]

Sets the value of labelCol.

setParams(self, formula=None, featuresCol="features", labelCol="label", forceIndexLabel=False, stringIndexerOrderType="frequencyDesc", handleInvalid="error")[source]

Sets params for RFormula.

New in version 1.5.0.

setStringIndexerOrderType(value)[source]

Sets the value of stringIndexerOrderType.

New in version 2.3.0.

stringIndexerOrderType = Param(parent='undefined', name='stringIndexerOrderType', doc='How to order categories of a string feature column used by StringIndexer. The last category after ordering is dropped when encoding strings. Supported options: frequencyDesc, frequencyAsc, alphabetDesc, alphabetAsc. The default value is frequencyDesc. When the ordering is set to alphabetDesc, RFormula drops the same category as R when encoding strings.')
write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.RFormulaModel(java_model=None)[source]

Model fitted by RFormula. Fitting is required to determine the factor levels of formula terms.

New in version 1.5.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name.')
forceIndexLabel = Param(parent='undefined', name='forceIndexLabel', doc='Force to index label whether it is numeric or string')
formula = Param(parent='undefined', name='formula', doc='R model formula')
getFeaturesCol()

Gets the value of featuresCol or its default value.

getForceIndexLabel()

Gets the value of forceIndexLabel.

New in version 2.1.0.

getFormula()

Gets the value of formula.

New in version 1.5.0.

getHandleInvalid()

Gets the value of handleInvalid or its default value.

getLabelCol()

Gets the value of labelCol or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)

Gets a param by its name.

getStringIndexerOrderType()

Gets the value of stringIndexerOrderType or its default value ‘frequencyDesc’.

New in version 2.3.0.

handleInvalid = Param(parent='undefined', name='handleInvalid', doc="how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (put invalid data in a special additional bucket, at index numLabels).")
hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

labelCol = Param(parent='undefined', name='labelCol', doc='label column name.')
classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

stringIndexerOrderType = Param(parent='undefined', name='stringIndexerOrderType', doc='How to order categories of a string feature column used by StringIndexer. The last category after ordering is dropped when encoding strings. Supported options: frequencyDesc, frequencyAsc, alphabetDesc, alphabetAsc. The default value is frequencyDesc. When the ordering is set to alphabetDesc, RFormula drops the same category as R when encoding strings.')
transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.SQLTransformer(statement=None)[source]

Implements the transforms which are defined by SQL statement. Currently we only support SQL syntax like ‘SELECT … FROM __THIS__’ where ‘__THIS__’ represents the underlying table of the input dataset.

>>> df = spark.createDataFrame([(0, 1.0, 3.0), (2, 2.0, 5.0)], ["id", "v1", "v2"])
>>> sqlTrans = SQLTransformer(
...     statement="SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")
>>> sqlTrans.transform(df).head()
Row(id=0, v1=1.0, v2=3.0, v3=4.0, v4=3.0)
>>> sqlTransformerPath = temp_path + "/sql-transformer"
>>> sqlTrans.save(sqlTransformerPath)
>>> loadedSqlTrans = SQLTransformer.load(sqlTransformerPath)
>>> loadedSqlTrans.getStatement() == sqlTrans.getStatement()
True
>>> loadedSqlTrans.transform(df).take(1) == sqlTrans.transform(df).take(1)
True

New in version 1.6.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)

Gets a param by its name.

getStatement()[source]

Gets the value of statement or its default value.

New in version 1.6.0.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setParams(self, statement=None)[source]

Sets params for this SQLTransformer.

New in version 1.6.0.

setStatement(value)[source]

Sets the value of statement.

New in version 1.6.0.

statement = Param(parent='undefined', name='statement', doc='SQL statement')
transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class pyspark.ml.feature.StandardScaler(withMean=False, withStd=True, inputCol=None, outputCol=None)[source]

Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

The “unit std” is computed using the corrected sample standard deviation, which is computed as the square root of the unbiased sample variance.

>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(Vectors.dense([0.0]),), (Vectors.dense([2.0]),)], ["a"])
>>> standardScaler = StandardScaler()
>>> standardScaler.setInputCol("a")
StandardScaler...
>>> standardScaler.setOutputCol("scaled")
StandardScaler...
>>> model = standardScaler.fit(df)
>>> model.getInputCol()
'a'
>>> model.setOutputCol("output")
StandardScalerModel...
>>> model.mean
DenseVector([1.0])
>>> model.std
DenseVector([1.4142])
>>> model.transform(df).collect()[1].output
DenseVector([1.4142])
>>> standardScalerPath = temp_path + "/standard-scaler"
>>> standardScaler.save(standardScalerPath)
>>> loadedStandardScaler = StandardScaler.load(standardScalerPath)
>>> loadedStandardScaler.getWithMean() == standardScaler.getWithMean()
True
>>> loadedStandardScaler.getWithStd() == standardScaler.getWithStd()
True
>>> modelPath = temp_path + "/standard-scaler-model"
>>> model.save(modelPath)
>>> loadedModel = StandardScalerModel.load(modelPath)
>>> loadedModel.std == model.std
True
>>> loadedModel.mean == model.mean
True
>>> loadedModel.transform(df).take(1) == model.transform(df).take(1)
True

New in version 1.4.0.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getInputCol()

Gets the value of inputCol or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getWithMean()

Gets the value of withMean or its default value.

New in version 1.4.0.

getWithStd()

Gets the value of withStd or its default value.

New in version 1.4.0.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

setOutputCol(value)