PCA

class pyspark.ml.feature.PCA(*, k=None, inputCol=None, outputCol=None)[source]

PCA trains a model to project vectors to a lower dimensional space of the top k principal components.

New in version 1.5.0.

Examples

>>> from pyspark.ml.linalg import Vectors
>>> data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
...     (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
...     (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
>>> df = spark.createDataFrame(data,["features"])
>>> pca = PCA(k=2, inputCol="features")
>>> pca.setOutputCol("pca_features")
PCA...
>>> model = pca.fit(df)
>>> model.getK()
2
>>> model.setOutputCol("output")
PCAModel...
>>> model.transform(df).collect()[0].output
DenseVector([1.648..., -4.013...])
>>> model.explainedVariance
DenseVector([0.794..., 0.205...])
>>> pcaPath = temp_path + "/pca"
>>> pca.save(pcaPath)
>>> loadedPca = PCA.load(pcaPath)
>>> loadedPca.getK() == pca.getK()
True
>>> modelPath = temp_path + "/pca-model"
>>> model.save(modelPath)
>>> loadedModel = PCAModel.load(modelPath)
>>> loadedModel.pc == model.pc
True
>>> loadedModel.explainedVariance == model.explainedVariance
True
>>> loadedModel.transform(df).take(1) == model.transform(df).take(1)
True

Methods

clear(param)

Clears a param from the param map if it has been explicitly set.

copy([extra])

Creates a copy of this instance with the same uid and some extra params.

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap([extra])

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

fit(dataset[, params])

Fits a model to the input dataset with optional parameters.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

getInputCol()

Gets the value of inputCol or its default value.

getK()

Gets the value of k or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)

Sets the value of inputCol.

setK(value)

Sets the value of k.

setOutputCol(value)

Sets the value of outputCol.

setParams(self, \*[, k, inputCol, outputCol])

Set params for this PCA.

write()

Returns an MLWriter instance for this ML instance.

Attributes

inputCol

k

outputCol

params

Returns all params ordered by name.

Methods Documentation

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:
extradict, optional

Extra parameters to copy to the new instance

Returns:
JavaParams

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:
extradict, optional

extra param values

Returns:
dict

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
datasetpyspark.sql.DataFrame

input dataset.

paramsdict or list or tuple, optional

an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns:
Transformer or a list of Transformer

fitted model(s)

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

New in version 2.3.0.

Parameters:
datasetpyspark.sql.DataFrame

input dataset.

paramMapscollections.abc.Sequence

A Sequence of param maps.

Returns:
_FitMultipleIterator

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

getInputCol()

Gets the value of inputCol or its default value.

getK()

Gets the value of k or its default value.

New in version 1.5.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

setK(value)[source]

Sets the value of k.

New in version 1.5.0.

setOutputCol(value)[source]

Sets the value of outputCol.

setParams(self, \*, k=None, inputCol=None, outputCol=None)[source]

Set params for this PCA.

New in version 1.5.0.

write()

Returns an MLWriter instance for this ML instance.

Attributes Documentation

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
k = Param(parent='undefined', name='k', doc='the number of principal components')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.