pyspark.ml package¶
ML Pipeline APIs¶
DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.
-
class
pyspark.ml.
Transformer
[source]¶ Abstract class for transformers that transform one dataset into another.
New in version 1.3.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)[source]¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.
Estimator
[source]¶ Abstract class for estimators that fit models to data.
New in version 1.3.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset, params=None)[source]¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
-
class
pyspark.ml.
Model
[source]¶ Abstract class for models that are fitted by estimators.
New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
-
class
pyspark.ml.
Pipeline
(stages=None)[source]¶ A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each of which is either an
Estimator
or aTransformer
. WhenPipeline.fit()
is called, the stages are executed in order. If a stage is anEstimator
, itsEstimator.fit()
method will be called on the input dataset to fit a model. Then the model, which is a transformer, will be used to transform the dataset as the input to the next stage. If a stage is aTransformer
, itsTransformer.transform()
method will be called to produce the dataset for the next stage. The fitted model from aPipeline
is aPipelineModel
, which consists of fitted models and transformers, corresponding to the pipeline stages. If stages is an empty list, the pipeline acts as an identity transformer.New in version 1.3.0.
-
copy
(extra=None)[source]¶ Creates a copy of this instance.
Parameters: extra – extra parameters Returns: new instance New in version 1.4.0.
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
save
(path)[source]¶ Save this ML instance to the given path, a shortcut of write().save(path).
New in version 2.0.0.
-
setStages
(value)[source]¶ Set pipeline stages.
Parameters: value – a list of transformers or estimators Returns: the pipeline instance New in version 1.3.0.
-
stages
= Param(parent='undefined', name='stages', doc='a list of pipeline stages')¶
-
-
class
pyspark.ml.
PipelineModel
(stages)[source]¶ Represents a compiled pipeline with transformers and fitted models.
New in version 1.3.0.
-
copy
(extra=None)[source]¶ Creates a copy of this instance.
Parameters: extra – extra parameters Returns: new instance New in version 1.4.0.
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
save
(path)[source]¶ Save this ML instance to the given path, a shortcut of write().save(path).
New in version 2.0.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
pyspark.ml.param module¶
-
class
pyspark.ml.param.
Param
(parent, name, doc, typeConverter=None)[source]¶ A param with self-contained documentation.
New in version 1.3.0.
-
class
pyspark.ml.param.
Params
[source]¶ Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
-
copy
(extra=None)[source]¶ Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using
copy.copy()
, and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)[source]¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()[source]¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)[source]¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)[source]¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
pyspark.ml.feature module¶
-
class
pyspark.ml.feature.
Binarizer
(threshold=0.0, inputCol=None, outputCol=None)[source]¶ Binarize a column of continuous features given a threshold.
>>> df = spark.createDataFrame([(0.5,)], ["values"]) >>> binarizer = Binarizer(threshold=1.0, inputCol="values", outputCol="features") >>> binarizer.transform(df).head().features 0.0 >>> binarizer.setParams(outputCol="freqs").transform(df).head().freqs 0.0 >>> params = {binarizer.threshold: -0.5, binarizer.outputCol: "vector"} >>> binarizer.transform(df, params).head().vector 1.0 >>> binarizerPath = temp_path + "/binarizer" >>> binarizer.save(binarizerPath) >>> loadedBinarizer = Binarizer.load(binarizerPath) >>> loadedBinarizer.getThreshold() == binarizer.getThreshold() True
New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setParams
(self, threshold=0.0, inputCol=None, outputCol=None)[source]¶ Sets params for this Binarizer.
New in version 1.4.0.
-
threshold
= Param(parent='undefined', name='threshold', doc='threshold in binary classification prediction, in range [0, 1]')¶
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
BucketedRandomProjectionLSH
(inputCol=None, outputCol=None, seed=None, numHashTables=1, bucketLength=None)[source]¶ Note
Experimental
LSH class for Euclidean distance metrics. The input is dense or sparse vectors, each of which represents a point in the Euclidean distance space. The output will be vectors of configurable dimension. Hash values in the same dimension are calculated by the same hash function.
See also
>>> from pyspark.ml.linalg import Vectors >>> from pyspark.sql.functions import col >>> data = [(0, Vectors.dense([-1.0, -1.0 ]),), ... (1, Vectors.dense([-1.0, 1.0 ]),), ... (2, Vectors.dense([1.0, -1.0 ]),), ... (3, Vectors.dense([1.0, 1.0]),)] >>> df = spark.createDataFrame(data, ["id", "features"]) >>> brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", ... seed=12345, bucketLength=1.0) >>> model = brp.fit(df) >>> model.transform(df).head() Row(id=0, features=DenseVector([-1.0, -1.0]), hashes=[DenseVector([-1.0])]) >>> data2 = [(4, Vectors.dense([2.0, 2.0 ]),), ... (5, Vectors.dense([2.0, 3.0 ]),), ... (6, Vectors.dense([3.0, 2.0 ]),), ... (7, Vectors.dense([3.0, 3.0]),)] >>> df2 = spark.createDataFrame(data2, ["id", "features"]) >>> model.approxNearestNeighbors(df2, Vectors.dense([1.0, 2.0]), 1).collect() [Row(id=4, features=DenseVector([2.0, 2.0]), hashes=[DenseVector([1.0])], distCol=1.0)] >>> model.approxSimilarityJoin(df, df2, 3.0, distCol="EuclideanDistance").select( ... col("datasetA.id").alias("idA"), ... col("datasetB.id").alias("idB"), ... col("EuclideanDistance")).show() +---+---+-----------------+ |idA|idB|EuclideanDistance| +---+---+-----------------+ | 3| 6| 2.23606797749979| +---+---+-----------------+ ... >>> brpPath = temp_path + "/brp" >>> brp.save(brpPath) >>> brp2 = BucketedRandomProjectionLSH.load(brpPath) >>> brp2.getBucketLength() == brp.getBucketLength() True >>> modelPath = temp_path + "/brp-model" >>> model.save(modelPath) >>> model2 = BucketedRandomProjectionLSHModel.load(modelPath) >>> model.transform(df).head().hashes == model2.transform(df).head().hashes True
New in version 2.2.0.
-
bucketLength
= Param(parent='undefined', name='bucketLength', doc='the length of each hash bucket, a larger bucket lowers the false negative rate.')¶
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
getBucketLength
()[source]¶ Gets the value of bucketLength or its default value.
New in version 2.2.0.
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getNumHashTables
()¶ Gets the value of numHashTables or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
getSeed
()¶ Gets the value of seed or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
numHashTables
= Param(parent='undefined', name='numHashTables', doc='number of hash tables, where increasing number of hash tables lowers the false negative rate, and decreasing it improves the running performance.')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
seed
= Param(parent='undefined', name='seed', doc='random seed.')¶
-
setBucketLength
(value)[source]¶ Sets the value of
bucketLength
.New in version 2.2.0.
-
setNumHashTables
(value)¶ Sets the value of
numHashTables
.
-
setParams
(self, inputCol=None, outputCol=None, seed=None, numHashTables=1, bucketLength=None)[source]¶ Sets params for this BucketedRandomProjectionLSH.
New in version 2.2.0.
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
BucketedRandomProjectionLSHModel
(java_model=None)[source]¶ Note
Experimental
Model fitted by
BucketedRandomProjectionLSH
, where multiple random vectors are stored. The vectors are normalized to be unit vectors and each vector is used in a hash function: \(h_i(x) = floor(r_i \cdot x / bucketLength)\) where \(r_i\) is the i-th random unit vector. The number of buckets will be (max L2 norm of input vectors) / bucketLength.New in version 2.2.0.
-
approxNearestNeighbors
(dataset, key, numNearestNeighbors, distCol='distCol')¶ Given a large dataset and an item, approximately find at most k items which have the closest distance to the item. If the
outputCol
is missing, the method will transform the data; if theoutputCol
exists, it will use that. This allows caching of the transformed data when necessary.Note
This method is experimental and will likely change behavior in the next release.
Parameters: - dataset – The dataset to search for nearest neighbors of the key.
- key – Feature vector representing the item to search for.
- numNearestNeighbors – The maximum number of nearest neighbors.
- distCol – Output column for storing the distance between each result row and the key. Use “distCol” as default value if it’s not specified.
Returns: A dataset containing at most k items closest to the key. A column “distCol” is added to show the distance between each row and the key.
-
approxSimilarityJoin
(datasetA, datasetB, threshold, distCol='distCol')¶ Join two datasets to approximately find all pairs of rows whose distance are smaller than the threshold. If the
outputCol
is missing, the method will transform the data; if theoutputCol
exists, it will use that. This allows caching of the transformed data when necessary.Parameters: - datasetA – One of the datasets to join.
- datasetB – Another dataset to join.
- threshold – The threshold for the distance of row pairs.
- distCol – Output column for storing the distance between each pair of rows. Use “distCol” as default value if it’s not specified.
Returns: A joined dataset containing pairs of rows. The original rows are in columns “datasetA” and “datasetB”, and a column “distCol” is added to show the distance between each pair.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
Bucketizer
(splits=None, inputCol=None, outputCol=None, handleInvalid='error')[source]¶ Maps a column of continuous features to a column of feature buckets.
>>> values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), (float("nan"),)] >>> df = spark.createDataFrame(values, ["values"]) >>> bucketizer = Bucketizer(splits=[-float("inf"), 0.5, 1.4, float("inf")], ... inputCol="values", outputCol="buckets") >>> bucketed = bucketizer.setHandleInvalid("keep").transform(df).collect() >>> len(bucketed) 6 >>> bucketed[0].buckets 0.0 >>> bucketed[1].buckets 0.0 >>> bucketed[2].buckets 1.0 >>> bucketed[3].buckets 2.0 >>> bucketizer.setParams(outputCol="b").transform(df).head().b 0.0 >>> bucketizerPath = temp_path + "/bucketizer" >>> bucketizer.save(bucketizerPath) >>> loadedBucketizer = Bucketizer.load(bucketizerPath) >>> loadedBucketizer.getSplits() == bucketizer.getSplits() True >>> bucketed = bucketizer.setHandleInvalid("skip").transform(df).collect() >>> len(bucketed) 4
New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getHandleInvalid
()[source]¶ Gets the value of
handleInvalid
or its default value.New in version 2.1.0.
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
handleInvalid
= Param(parent='undefined', name='handleInvalid', doc="how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket).")¶
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setHandleInvalid
(value)[source]¶ Sets the value of
handleInvalid
.New in version 2.1.0.
-
setParams
(self, splits=None, inputCol=None, outputCol=None, handleInvalid="error")[source]¶ Sets params for this Bucketizer.
New in version 1.4.0.
-
splits
= Param(parent='undefined', name='splits', doc='Split points for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. The splits should be of length >= 3 and strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; otherwise, values outside the splits specified will be treated as errors.')¶
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
ChiSqSelector
(numTopFeatures=50, featuresCol='features', outputCol=None, labelCol='label', selectorType='numTopFeatures', percentile=0.1, fpr=0.05, fdr=0.05, fwe=0.05)[source]¶ Note
Experimental
Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label. The selector supports different selection methods: numTopFeatures, percentile, fpr, fdr, fwe.
- numTopFeatures chooses a fixed number of top features according to a chi-squared test.
- percentile is similar but chooses a fraction of all features instead of a fixed number.
- fpr chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
- fdr uses the Benjamini-Hochberg procedure to choose all features whose false discovery rate is below a threshold.
- fwe chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.
By default, the selection method is numTopFeatures, with the default number of top features set to 50.
>>> from pyspark.ml.linalg import Vectors >>> df = spark.createDataFrame( ... [(Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0), ... (Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0), ... (Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0)], ... ["features", "label"]) >>> selector = ChiSqSelector(numTopFeatures=1, outputCol="selectedFeatures") >>> model = selector.fit(df) >>> model.transform(df).head().selectedFeatures DenseVector([18.0]) >>> model.selectedFeatures [2] >>> chiSqSelectorPath = temp_path + "/chi-sq-selector" >>> selector.save(chiSqSelectorPath) >>> loadedSelector = ChiSqSelector.load(chiSqSelectorPath) >>> loadedSelector.getNumTopFeatures() == selector.getNumTopFeatures() True >>> modelPath = temp_path + "/chi-sq-selector-model" >>> model.save(modelPath) >>> loadedModel = ChiSqSelectorModel.load(modelPath) >>> loadedModel.selectedFeatures == model.selectedFeatures True
New in version 2.0.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fdr
= Param(parent='undefined', name='fdr', doc='The upper bound of the expected false discovery rate.')¶
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name.')¶
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
fpr
= Param(parent='undefined', name='fpr', doc='The highest p-value for features to be kept.')¶
-
fwe
= Param(parent='undefined', name='fwe', doc='The upper bound of the expected family-wise error rate.')¶
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getNumTopFeatures
()[source]¶ Gets the value of numTopFeatures or its default value.
New in version 2.0.0.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
getSelectorType
()[source]¶ Gets the value of selectorType or its default value.
New in version 2.1.0.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name.')¶
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
numTopFeatures
= Param(parent='undefined', name='numTopFeatures', doc='Number of features that selector will select, ordered by ascending p-value. If the number of features is < numTopFeatures, then this will select all features.')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
percentile
= Param(parent='undefined', name='percentile', doc='Percentile of features that selector will select, ordered by ascending p-value.')¶
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
selectorType
= Param(parent='undefined', name='selectorType', doc='The selector type of the ChisqSelector. Supported options: numTopFeatures (default), percentile and fpr.')¶
-
setFdr
(value)[source]¶ Sets the value of
fdr
. Only applicable when selectorType = “fdr”.New in version 2.2.0.
-
setFeaturesCol
(value)¶ Sets the value of
featuresCol
.
-
setFpr
(value)[source]¶ Sets the value of
fpr
. Only applicable when selectorType = “fpr”.New in version 2.1.0.
-
setFwe
(value)[source]¶ Sets the value of
fwe
. Only applicable when selectorType = “fwe”.New in version 2.2.0.
-
setNumTopFeatures
(value)[source]¶ Sets the value of
numTopFeatures
. Only applicable when selectorType = “numTopFeatures”.New in version 2.0.0.
-
setParams
(self, numTopFeatures=50, featuresCol="features", outputCol=None, labelCol="labels", selectorType="numTopFeatures", percentile=0.1, fpr=0.05, fdr=0.05, fwe=0.05)[source]¶ Sets params for this ChiSqSelector.
New in version 2.0.0.
-
setPercentile
(value)[source]¶ Sets the value of
percentile
. Only applicable when selectorType = “percentile”.New in version 2.1.0.
-
setSelectorType
(value)[source]¶ Sets the value of
selectorType
.New in version 2.1.0.
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
class
pyspark.ml.feature.
ChiSqSelectorModel
(java_model=None)[source]¶ Note
Experimental
Model fitted by
ChiSqSelector
.New in version 2.0.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
selectedFeatures
¶ List of indices to select (filter).
New in version 2.0.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
CountVectorizer
(minTF=1.0, minDF=1.0, vocabSize=262144, binary=False, inputCol=None, outputCol=None)[source]¶ Extracts a vocabulary from document collections and generates a
CountVectorizerModel
.>>> df = spark.createDataFrame( ... [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ... ["label", "raw"]) >>> cv = CountVectorizer(inputCol="raw", outputCol="vectors") >>> model = cv.fit(df) >>> model.transform(df).show(truncate=False) +-----+---------------+-------------------------+ |label|raw |vectors | +-----+---------------+-------------------------+ |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])| |1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])| +-----+---------------+-------------------------+ ... >>> sorted(model.vocabulary) == ['a', 'b', 'c'] True >>> countVectorizerPath = temp_path + "/count-vectorizer" >>> cv.save(countVectorizerPath) >>> loadedCv = CountVectorizer.load(countVectorizerPath) >>> loadedCv.getMinDF() == cv.getMinDF() True >>> loadedCv.getMinTF() == cv.getMinTF() True >>> loadedCv.getVocabSize() == cv.getVocabSize() True >>> modelPath = temp_path + "/count-vectorizer-model" >>> model.save(modelPath) >>> loadedModel = CountVectorizerModel.load(modelPath) >>> loadedModel.vocabulary == model.vocabulary True
New in version 1.6.0.
-
binary
= Param(parent='undefined', name='binary', doc='Binary toggle to control the output vector values. If True, all nonzero counts (after minTF filter applied) are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. Default False')¶
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
minDF
= Param(parent='undefined', name='minDF', doc='Specifies the minimum number of different documents a term must appear in to be included in the vocabulary. If this is an integer >= 1, this specifies the number of documents the term must appear in; if this is a double in [0,1), then this specifies the fraction of documents. Default 1.0')¶
-
minTF
= Param(parent='undefined', name='minTF', doc="Filter to ignore rare words in a document. For each document, terms with frequency/count less than the given threshold are ignored. If this is an integer >= 1, then this specifies a count (of times the term must appear in the document); if this is a double in [0,1), then this specifies a fraction (out of the document's token count). Note that the parameter is only used in transform of CountVectorizerModel and does not affect fitting. Default 1.0")¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setParams
(self, minTF=1.0, minDF=1.0, vocabSize=1 << 18, binary=False, inputCol=None, outputCol=None)[source]¶ Set the params for the CountVectorizer
New in version 1.6.0.
-
vocabSize
= Param(parent='undefined', name='vocabSize', doc='max size of the vocabulary. Default 1 << 18.')¶
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
CountVectorizerModel
(java_model=None)[source]¶ Model fitted by
CountVectorizer
.New in version 1.6.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
vocabulary
¶ An array of terms in the vocabulary.
New in version 1.6.0.
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
DCT
(inverse=False, inputCol=None, outputCol=None)[source]¶ A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II).
See also
>>> from pyspark.ml.linalg import Vectors >>> df1 = spark.createDataFrame([(Vectors.dense([5.0, 8.0, 6.0]),)], ["vec"]) >>> dct = DCT(inverse=False, inputCol="vec", outputCol="resultVec") >>> df2 = dct.transform(df1) >>> df2.head().resultVec DenseVector([10.969..., -0.707..., -2.041...]) >>> df3 = DCT(inverse=True, inputCol="resultVec", outputCol="origVec").transform(df2) >>> df3.head().origVec DenseVector([5.0, 8.0, 6.0]) >>> dctPath = temp_path + "/dct" >>> dct.save(dctPath) >>> loadedDtc = DCT.load(dctPath) >>> loadedDtc.getInverse() False
New in version 1.6.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
inverse
= Param(parent='undefined', name='inverse', doc='Set transformer to perform inverse DCT, default False.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setParams
(self, inverse=False, inputCol=None, outputCol=None)[source]¶ Sets params for this DCT.
New in version 1.6.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
ElementwiseProduct
(scalingVec=None, inputCol=None, outputCol=None)[source]¶ Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided “weight” vector. In other words, it scales each column of the dataset by a scalar multiplier.
>>> from pyspark.ml.linalg import Vectors >>> df = spark.createDataFrame([(Vectors.dense([2.0, 1.0, 3.0]),)], ["values"]) >>> ep = ElementwiseProduct(scalingVec=Vectors.dense([1.0, 2.0, 3.0]), ... inputCol="values", outputCol="eprod") >>> ep.transform(df).head().eprod DenseVector([2.0, 2.0, 9.0]) >>> ep.setParams(scalingVec=Vectors.dense([2.0, 3.0, 5.0])).transform(df).head().eprod DenseVector([4.0, 3.0, 15.0]) >>> elementwiseProductPath = temp_path + "/elementwise-product" >>> ep.save(elementwiseProductPath) >>> loadedEp = ElementwiseProduct.load(elementwiseProductPath) >>> loadedEp.getScalingVec() == ep.getScalingVec() True
New in version 1.5.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
scalingVec
= Param(parent='undefined', name='scalingVec', doc='Vector for hadamard product.')¶
-
setParams
(self, scalingVec=None, inputCol=None, outputCol=None)[source]¶ Sets params for this ElementwiseProduct.
New in version 1.5.0.
-
setScalingVec
(value)[source]¶ Sets the value of
scalingVec
.New in version 2.0.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
HashingTF
(numFeatures=262144, binary=False, inputCol=None, outputCol=None)[source]¶ Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby’s MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.
>>> df = spark.createDataFrame([(["a", "b", "c"],)], ["words"]) >>> hashingTF = HashingTF(numFeatures=10, inputCol="words", outputCol="features") >>> hashingTF.transform(df).head().features SparseVector(10, {0: 1.0, 1: 1.0, 2: 1.0}) >>> hashingTF.setParams(outputCol="freqs").transform(df).head().freqs SparseVector(10, {0: 1.0, 1: 1.0, 2: 1.0}) >>> params = {hashingTF.numFeatures: 5, hashingTF.outputCol: "vector"} >>> hashingTF.transform(df, params).head().vector SparseVector(5, {0: 1.0, 1: 1.0, 2: 1.0}) >>> hashingTFPath = temp_path + "/hashing-tf" >>> hashingTF.save(hashingTFPath) >>> loadedHashingTF = HashingTF.load(hashingTFPath) >>> loadedHashingTF.getNumFeatures() == hashingTF.getNumFeatures() True
New in version 1.3.0.
-
binary
= Param(parent='undefined', name='binary', doc='If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. Default False.')¶
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getNumFeatures
()¶ Gets the value of numFeatures or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
numFeatures
= Param(parent='undefined', name='numFeatures', doc='number of features.')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setNumFeatures
(value)¶ Sets the value of
numFeatures
.
-
setParams
(self, numFeatures=1 << 18, binary=False, inputCol=None, outputCol=None)[source]¶ Sets params for this HashingTF.
New in version 1.3.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
IDF
(minDocFreq=0, inputCol=None, outputCol=None)[source]¶ Compute the Inverse Document Frequency (IDF) given a collection of documents.
>>> from pyspark.ml.linalg import DenseVector >>> df = spark.createDataFrame([(DenseVector([1.0, 2.0]),), ... (DenseVector([0.0, 1.0]),), (DenseVector([3.0, 0.2]),)], ["tf"]) >>> idf = IDF(minDocFreq=3, inputCol="tf", outputCol="idf") >>> model = idf.fit(df) >>> model.idf DenseVector([0.0, 0.0]) >>> model.transform(df).head().idf DenseVector([0.0, 0.0]) >>> idf.setParams(outputCol="freqs").fit(df).transform(df).collect()[1].freqs DenseVector([0.0, 0.0]) >>> params = {idf.minDocFreq: 1, idf.outputCol: "vector"} >>> idf.fit(df, params).transform(df).head().vector DenseVector([0.2877, 0.0]) >>> idfPath = temp_path + "/idf" >>> idf.save(idfPath) >>> loadedIdf = IDF.load(idfPath) >>> loadedIdf.getMinDocFreq() == idf.getMinDocFreq() True >>> modelPath = temp_path + "/idf-model" >>> model.save(modelPath) >>> loadedModel = IDFModel.load(modelPath) >>> loadedModel.transform(df).head().idf == model.transform(df).head().idf True
New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
minDocFreq
= Param(parent='undefined', name='minDocFreq', doc='minimum number of documents in which a term should appear for filtering')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setMinDocFreq
(value)[source]¶ Sets the value of
minDocFreq
.New in version 1.4.0.
-
setParams
(self, minDocFreq=0, inputCol=None, outputCol=None)[source]¶ Sets params for this IDF.
New in version 1.4.0.
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
IDFModel
(java_model=None)[source]¶ Model fitted by
IDF
.New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
idf
¶ Returns the IDF vector.
New in version 2.0.0.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
Imputer
(strategy='mean', missingValue=nan, inputCols=None, outputCols=None)[source]¶ Note
Experimental
Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located. The input columns should be of DoubleType or FloatType. Currently Imputer does not support categorical features and possibly creates incorrect values for a categorical feature.
Note that the mean/median value is computed after filtering out missing values. All Null values in the input columns are treated as missing, and so are also imputed. For computing median,
pyspark.sql.DataFrame.approxQuantile()
is used with a relative error of 0.001.>>> df = spark.createDataFrame([(1.0, float("nan")), (2.0, float("nan")), (float("nan"), 3.0), ... (4.0, 4.0), (5.0, 5.0)], ["a", "b"]) >>> imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"]) >>> model = imputer.fit(df) >>> model.surrogateDF.show() +---+---+ | a| b| +---+---+ |3.0|4.0| +---+---+ ... >>> model.transform(df).show() +---+---+-----+-----+ | a| b|out_a|out_b| +---+---+-----+-----+ |1.0|NaN| 1.0| 4.0| |2.0|NaN| 2.0| 4.0| |NaN|3.0| 3.0| 3.0| ... >>> imputer.setStrategy("median").setMissingValue(1.0).fit(df).transform(df).show() +---+---+-----+-----+ | a| b|out_a|out_b| +---+---+-----+-----+ |1.0|NaN| 4.0| NaN| ... >>> imputerPath = temp_path + "/imputer" >>> imputer.save(imputerPath) >>> loadedImputer = Imputer.load(imputerPath) >>> loadedImputer.getStrategy() == imputer.getStrategy() True >>> loadedImputer.getMissingValue() 1.0 >>> modelPath = temp_path + "/imputer-model" >>> model.save(modelPath) >>> loadedModel = ImputerModel.load(modelPath) >>> loadedModel.transform(df).head().out_a == model.transform(df).head().out_a True
New in version 2.2.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
getInputCols
()¶ Gets the value of inputCols or its default value.
-
getMissingValue
()[source]¶ Gets the value of
missingValue
or its default value.New in version 2.2.0.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCols
()[source]¶ Gets the value of
outputCols
or its default value.New in version 2.2.0.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCols
= Param(parent='undefined', name='inputCols', doc='input column names.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
missingValue
= Param(parent='undefined', name='missingValue', doc='The placeholder for the missing values. All occurrences of missingValue will be imputed.')¶
-
outputCols
= Param(parent='undefined', name='outputCols', doc='output column names.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setMissingValue
(value)[source]¶ Sets the value of
missingValue
.New in version 2.2.0.
-
setOutputCols
(value)[source]¶ Sets the value of
outputCols
.New in version 2.2.0.
-
setParams
(self, strategy="mean", missingValue=float("nan"), inputCols=None, outputCols=None)[source]¶ Sets params for this Imputer.
New in version 2.2.0.
-
strategy
= Param(parent='undefined', name='strategy', doc='strategy for imputation. If mean, then replace missing values using the mean value of the feature. If median, then replace missing values using the median value of the feature.')¶
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
ImputerModel
(java_model=None)[source]¶ Note
Experimental
Model fitted by
Imputer
.New in version 2.2.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
surrogateDF
¶ Returns a DataFrame containing inputCols and their corresponding surrogates, which are used to replace the missing values in the input DataFrame.
New in version 2.2.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
IndexToString
(inputCol=None, outputCol=None, labels=None)[source]¶ A
Transformer
that maps a column of indices back to a new column of corresponding string values. The index-string mapping is either from the ML attributes of the input column, or from user-supplied labels (which take precedence over ML attributes). SeeStringIndexer
for converting strings into indices.New in version 1.6.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labels
= Param(parent='undefined', name='labels', doc='Optional array of labels specifying index-string mapping. If not provided or if empty, then metadata from inputCol is used instead.')¶
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setParams
(self, inputCol=None, outputCol=None, labels=None)[source]¶ Sets params for this IndexToString.
New in version 1.6.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
MaxAbsScaler
(inputCol=None, outputCol=None)[source]¶ Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.
>>> from pyspark.ml.linalg import Vectors >>> df = spark.createDataFrame([(Vectors.dense([1.0]),), (Vectors.dense([2.0]),)], ["a"]) >>> maScaler = MaxAbsScaler(inputCol="a", outputCol="scaled") >>> model = maScaler.fit(df) >>> model.transform(df).show() +-----+------+ | a|scaled| +-----+------+ |[1.0]| [0.5]| |[2.0]| [1.0]| +-----+------+ ... >>> scalerPath = temp_path + "/max-abs-scaler" >>> maScaler.save(scalerPath) >>> loadedMAScaler = MaxAbsScaler.load(scalerPath) >>> loadedMAScaler.getInputCol() == maScaler.getInputCol() True >>> loadedMAScaler.getOutputCol() == maScaler.getOutputCol() True >>> modelPath = temp_path + "/max-abs-scaler-model" >>> model.save(modelPath) >>> loadedModel = MaxAbsScalerModel.load(modelPath) >>> loadedModel.maxAbs == model.maxAbs True
New in version 2.0.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setParams
(self, inputCol=None, outputCol=None)[source]¶ Sets params for this MaxAbsScaler.
New in version 2.0.0.
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
MaxAbsScalerModel
(java_model=None)[source]¶ Model fitted by
MaxAbsScaler
.New in version 2.0.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
maxAbs
¶ Max Abs vector.
New in version 2.0.0.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
MinHashLSH
(inputCol=None, outputCol=None, seed=None, numHashTables=1)[source]¶ Note
Experimental
LSH class for Jaccard distance. The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example, Vectors.sparse(10, [(2, 1.0), (3, 1.0), (5, 1.0)]) means there are 10 elements in the space. This set contains elements 2, 3, and 5. Also, any input vector must have at least 1 non-zero index, and all non-zero values are treated as binary “1” values.
See also
>>> from pyspark.ml.linalg import Vectors >>> from pyspark.sql.functions import col >>> data = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),), ... (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),), ... (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)] >>> df = spark.createDataFrame(data, ["id", "features"]) >>> mh = MinHashLSH(inputCol="features", outputCol="hashes", seed=12345) >>> model = mh.fit(df) >>> model.transform(df).head() Row(id=0, features=SparseVector(6, {0: 1.0, 1: 1.0, 2: 1.0}), hashes=[DenseVector([-1638925... >>> data2 = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),), ... (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),), ... (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)] >>> df2 = spark.createDataFrame(data2, ["id", "features"]) >>> key = Vectors.sparse(6, [1, 2], [1.0, 1.0]) >>> model.approxNearestNeighbors(df2, key, 1).collect() [Row(id=5, features=SparseVector(6, {1: 1.0, 2: 1.0, 4: 1.0}), hashes=[DenseVector([-163892... >>> model.approxSimilarityJoin(df, df2, 0.6, distCol="JaccardDistance").select( ... col("datasetA.id").alias("idA"), ... col("datasetB.id").alias("idB"), ... col("JaccardDistance")).show() +---+---+---------------+ |idA|idB|JaccardDistance| +---+---+---------------+ | 1| 4| 0.5| | 0| 5| 0.5| +---+---+---------------+ ... >>> mhPath = temp_path + "/mh" >>> mh.save(mhPath) >>> mh2 = MinHashLSH.load(mhPath) >>> mh2.getOutputCol() == mh.getOutputCol() True >>> modelPath = temp_path + "/mh-model" >>> model.save(modelPath) >>> model2 = MinHashLSHModel.load(modelPath) >>> model.transform(df).head().hashes == model2.transform(df).head().hashes True
New in version 2.2.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getNumHashTables
()¶ Gets the value of numHashTables or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
getSeed
()¶ Gets the value of seed or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
numHashTables
= Param(parent='undefined', name='numHashTables', doc='number of hash tables, where increasing number of hash tables lowers the false negative rate, and decreasing it improves the running performance.')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
seed
= Param(parent='undefined', name='seed', doc='random seed.')¶
-
setNumHashTables
(value)¶ Sets the value of
numHashTables
.
-
setParams
(self, inputCol=None, outputCol=None, seed=None, numHashTables=1)[source]¶ Sets params for this MinHashLSH.
New in version 2.2.0.
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
MinHashLSHModel
(java_model=None)[source]¶ Note
Experimental
Model produced by
MinHashLSH
, where where multiple hash functions are stored. Each hash function is picked from the following family of hash functions, where \(a_i\) and \(b_i\) are randomly chosen integers less than prime: \(h_i(x) = ((x \cdot a_i + b_i) \mod prime)\) This hash family is approximately min-wise independent according to the reference.See also
Tom Bohman, Colin Cooper, and Alan Frieze. “Min-wise independent linear permutations.” Electronic Journal of Combinatorics 7 (2000): R26.
New in version 2.2.0.
-
approxNearestNeighbors
(dataset, key, numNearestNeighbors, distCol='distCol')¶ Given a large dataset and an item, approximately find at most k items which have the closest distance to the item. If the
outputCol
is missing, the method will transform the data; if theoutputCol
exists, it will use that. This allows caching of the transformed data when necessary.Note
This method is experimental and will likely change behavior in the next release.
Parameters: - dataset – The dataset to search for nearest neighbors of the key.
- key – Feature vector representing the item to search for.
- numNearestNeighbors – The maximum number of nearest neighbors.
- distCol – Output column for storing the distance between each result row and the key. Use “distCol” as default value if it’s not specified.
Returns: A dataset containing at most k items closest to the key. A column “distCol” is added to show the distance between each row and the key.
-
approxSimilarityJoin
(datasetA, datasetB, threshold, distCol='distCol')¶ Join two datasets to approximately find all pairs of rows whose distance are smaller than the threshold. If the
outputCol
is missing, the method will transform the data; if theoutputCol
exists, it will use that. This allows caching of the transformed data when necessary.Parameters: - datasetA – One of the datasets to join.
- datasetB – Another dataset to join.
- threshold – The threshold for the distance of row pairs.
- distCol – Output column for storing the distance between each pair of rows. Use “distCol” as default value if it’s not specified.
Returns: A joined dataset containing pairs of rows. The original rows are in columns “datasetA” and “datasetB”, and a column “distCol” is added to show the distance between each pair.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
MinMaxScaler
(min=0.0, max=1.0, inputCol=None, outputCol=None)[source]¶ Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling. The rescaled value for feature E is calculated as,
Rescaled(e_i) = (e_i - E_min) / (E_max - E_min) * (max - min) + min
For the case E_max == E_min, Rescaled(e_i) = 0.5 * (max + min)
Note
Since zero values will probably be transformed to non-zero values, output of the transformer will be DenseVector even for sparse input.
>>> from pyspark.ml.linalg import Vectors >>> df = spark.createDataFrame([(Vectors.dense([0.0]),), (Vectors.dense([2.0]),)], ["a"]) >>> mmScaler = MinMaxScaler(inputCol="a", outputCol="scaled") >>> model = mmScaler.fit(df) >>> model.originalMin DenseVector([0.0]) >>> model.originalMax DenseVector([2.0]) >>> model.transform(df).show() +-----+------+ | a|scaled| +-----+------+ |[0.0]| [0.0]| |[2.0]| [1.0]| +-----+------+ ... >>> minMaxScalerPath = temp_path + "/min-max-scaler" >>> mmScaler.save(minMaxScalerPath) >>> loadedMMScaler = MinMaxScaler.load(minMaxScalerPath) >>> loadedMMScaler.getMin() == mmScaler.getMin() True >>> loadedMMScaler.getMax() == mmScaler.getMax() True >>> modelPath = temp_path + "/min-max-scaler-model" >>> model.save(modelPath) >>> loadedModel = MinMaxScalerModel.load(modelPath) >>> loadedModel.originalMin == model.originalMin True >>> loadedModel.originalMax == model.originalMax True
New in version 1.6.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
max
= Param(parent='undefined', name='max', doc='Upper bound of the output feature range')¶
-
min
= Param(parent='undefined', name='min', doc='Lower bound of the output feature range')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setParams
(self, min=0.0, max=1.0, inputCol=None, outputCol=None)[source]¶ Sets params for this MinMaxScaler.
New in version 1.6.0.
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
MinMaxScalerModel
(java_model=None)[source]¶ Model fitted by
MinMaxScaler
.New in version 1.6.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
originalMax
¶ Max value for each original column during fitting.
New in version 2.0.0.
-
originalMin
¶ Min value for each original column during fitting.
New in version 2.0.0.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
NGram
(n=2, inputCol=None, outputCol=None)[source]¶ A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words. When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.
>>> df = spark.createDataFrame([Row(inputTokens=["a", "b", "c", "d", "e"])]) >>> ngram = NGram(n=2, inputCol="inputTokens", outputCol="nGrams") >>> ngram.transform(df).head() Row(inputTokens=['a', 'b', 'c', 'd', 'e'], nGrams=['a b', 'b c', 'c d', 'd e']) >>> # Change n-gram length >>> ngram.setParams(n=4).transform(df).head() Row(inputTokens=['a', 'b', 'c', 'd', 'e'], nGrams=['a b c d', 'b c d e']) >>> # Temporarily modify output column. >>> ngram.transform(df, {ngram.outputCol: "output"}).head() Row(inputTokens=['a', 'b', 'c', 'd', 'e'], output=['a b c d', 'b c d e']) >>> ngram.transform(df).head() Row(inputTokens=['a', 'b', 'c', 'd', 'e'], nGrams=['a b c d', 'b c d e']) >>> # Must use keyword arguments to specify params. >>> ngram.setParams("text") Traceback (most recent call last): ... TypeError: Method setParams forces keyword arguments. >>> ngramPath = temp_path + "/ngram" >>> ngram.save(ngramPath) >>> loadedNGram = NGram.load(ngramPath) >>> loadedNGram.getN() == ngram.getN() True
New in version 1.5.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
n
= Param(parent='undefined', name='n', doc='number of elements per n-gram (>=1)')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setParams
(self, n=2, inputCol=None, outputCol=None)[source]¶ Sets params for this NGram.
New in version 1.5.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
Normalizer
(p=2.0, inputCol=None, outputCol=None)[source]¶ - Normalize a vector to have unit norm using the given p-norm.
>>> from pyspark.ml.linalg import Vectors >>> svec = Vectors.sparse(4, {1: 4.0, 3: 3.0}) >>> df = spark.createDataFrame([(Vectors.dense([3.0, -4.0]), svec)], ["dense", "sparse"]) >>> normalizer = Normalizer(p=2.0, inputCol="dense", outputCol="features") >>> normalizer.transform(df).head().features DenseVector([0.6, -0.8]) >>> normalizer.setParams(inputCol="sparse", outputCol="freqs").transform(df).head().freqs SparseVector(4, {1: 0.8, 3: 0.6}) >>> params = {normalizer.p: 1.0, normalizer.inputCol: "dense", normalizer.outputCol: "vector"} >>> normalizer.transform(df, params).head().vector DenseVector([0.4286, -0.5714]) >>> normalizerPath = temp_path + "/normalizer" >>> normalizer.save(normalizerPath) >>> loadedNormalizer = Normalizer.load(normalizerPath) >>> loadedNormalizer.getP() == normalizer.getP() True
New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
p
= Param(parent='undefined', name='p', doc='the p norm value.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setParams
(self, p=2.0, inputCol=None, outputCol=None)[source]¶ Sets params for this Normalizer.
New in version 1.4.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
OneHotEncoder
(dropLast=True, inputCol=None, outputCol=None)[source]¶ A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via
dropLast
) because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].Note
This is different from scikit-learn’s OneHotEncoder, which keeps all categories. The output vectors are sparse.
See also
StringIndexer
for converting categorical values into category indices>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed") >>> model = stringIndexer.fit(stringIndDf) >>> td = model.transform(stringIndDf) >>> encoder = OneHotEncoder(inputCol="indexed", outputCol="features") >>> encoder.transform(td).head().features SparseVector(2, {0: 1.0}) >>> encoder.setParams(outputCol="freqs").transform(td).head().freqs SparseVector(2, {0: 1.0}) >>> params = {encoder.dropLast: False, encoder.outputCol: "test"} >>> encoder.transform(td, params).head().test SparseVector(3, {0: 1.0}) >>> onehotEncoderPath = temp_path + "/onehot-encoder" >>> encoder.save(onehotEncoderPath) >>> loadedEncoder = OneHotEncoder.load(onehotEncoderPath) >>> loadedEncoder.getDropLast() == encoder.getDropLast() True
New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
dropLast
= Param(parent='undefined', name='dropLast', doc='whether to drop the last category')¶
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setParams
(self, dropLast=True, inputCol=None, outputCol=None)[source]¶ Sets params for this OneHotEncoder.
New in version 1.4.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
PCA
(k=None, inputCol=None, outputCol=None)[source]¶ PCA trains a model to project vectors to a lower dimensional space of the top
k
principal components.>>> from pyspark.ml.linalg import Vectors >>> data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),), ... (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), ... (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)] >>> df = spark.createDataFrame(data,["features"]) >>> pca = PCA(k=2, inputCol="features", outputCol="pca_features") >>> model = pca.fit(df) >>> model.transform(df).collect()[0].pca_features DenseVector([1.648..., -4.013...]) >>> model.explainedVariance DenseVector([0.794..., 0.205...]) >>> pcaPath = temp_path + "/pca" >>> pca.save(pcaPath) >>> loadedPca = PCA.load(pcaPath) >>> loadedPca.getK() == pca.getK() True >>> modelPath = temp_path + "/pca-model" >>> model.save(modelPath) >>> loadedModel = PCAModel.load(modelPath) >>> loadedModel.pc == model.pc True >>> loadedModel.explainedVariance == model.explainedVariance True
New in version 1.5.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
k
= Param(parent='undefined', name='k', doc='the number of principal components')¶
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setParams
(self, k=None, inputCol=None, outputCol=None)[source]¶ Set params for this PCA.
New in version 1.5.0.
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
PCAModel
(java_model=None)[source]¶ Model fitted by
PCA
. Transforms vectors to a lower dimensional space.New in version 1.5.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
explainedVariance
¶ Returns a vector of proportions of variance explained by each principal component.
New in version 2.0.0.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
pc
¶ Returns a principal components Matrix. Each column is one principal component.
New in version 2.0.0.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
PolynomialExpansion
(degree=2, inputCol=None, outputCol=None)[source]¶ Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion, “In mathematics, an expansion of a product of sums expresses it as a sum of products by using the fact that multiplication distributes over addition”. Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y).
>>> from pyspark.ml.linalg import Vectors >>> df = spark.createDataFrame([(Vectors.dense([0.5, 2.0]),)], ["dense"]) >>> px = PolynomialExpansion(degree=2, inputCol="dense", outputCol="expanded") >>> px.transform(df).head().expanded DenseVector([0.5, 0.25, 2.0, 1.0, 4.0]) >>> px.setParams(outputCol="test").transform(df).head().test DenseVector([0.5, 0.25, 2.0, 1.0, 4.0]) >>> polyExpansionPath = temp_path + "/poly-expansion" >>> px.save(polyExpansionPath) >>> loadedPx = PolynomialExpansion.load(polyExpansionPath) >>> loadedPx.getDegree() == px.getDegree() True
New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
degree
= Param(parent='undefined', name='degree', doc='the polynomial degree to expand (>= 1)')¶
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setParams
(self, degree=2, inputCol=None, outputCol=None)[source]¶ Sets params for this PolynomialExpansion.
New in version 1.4.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
QuantileDiscretizer
(numBuckets=2, inputCol=None, outputCol=None, relativeError=0.001, handleInvalid='error')[source]¶ Note
Experimental
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the
numBuckets
parameter. It is possible that the number of buckets used will be less than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles.NaN handling: Note also that QuantileDiscretizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting
handleInvalid
parameter. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
approxQuantile()
for a detailed description). The precision of the approximation can be controlled with therelativeError
parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values.>>> values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), (float("nan"),)] >>> df = spark.createDataFrame(values, ["values"]) >>> qds = QuantileDiscretizer(numBuckets=2, ... inputCol="values", outputCol="buckets", relativeError=0.01, handleInvalid="error") >>> qds.getRelativeError() 0.01 >>> bucketizer = qds.fit(df) >>> qds.setHandleInvalid("keep").fit(df).transform(df).count() 6 >>> qds.setHandleInvalid("skip").fit(df).transform(df).count() 4 >>> splits = bucketizer.getSplits() >>> splits[0] -inf >>> print("%2.1f" % round(splits[1], 1)) 0.4 >>> bucketed = bucketizer.transform(df).head() >>> bucketed.buckets 0.0 >>> quantileDiscretizerPath = temp_path + "/quantile-discretizer" >>> qds.save(quantileDiscretizerPath) >>> loadedQds = QuantileDiscretizer.load(quantileDiscretizerPath) >>> loadedQds.getNumBuckets() == qds.getNumBuckets() True
New in version 2.0.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
getHandleInvalid
()[source]¶ Gets the value of
handleInvalid
or its default value.New in version 2.1.0.
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
getRelativeError
()[source]¶ Gets the value of relativeError or its default value.
New in version 2.0.0.
-
handleInvalid
= Param(parent='undefined', name='handleInvalid', doc='how to handle invalid entries. Options are skip (filter out rows with invalid values), error (throw an error), or keep (keep invalid values in a special additional bucket).')¶
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
numBuckets
= Param(parent='undefined', name='numBuckets', doc='Maximum number of buckets (quantiles, or categories) into which data points are grouped. Must be >= 2.')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
relativeError
= Param(parent='undefined', name='relativeError', doc='The relative target precision for the approximate quantile algorithm used to generate buckets. Must be in the range [0, 1].')¶
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setHandleInvalid
(value)[source]¶ Sets the value of
handleInvalid
.New in version 2.1.0.
-
setNumBuckets
(value)[source]¶ Sets the value of
numBuckets
.New in version 2.0.0.
-
setParams
(self, numBuckets=2, inputCol=None, outputCol=None, relativeError=0.001, handleInvalid="error")[source]¶ Set the params for the QuantileDiscretizer
New in version 2.0.0.
-
setRelativeError
(value)[source]¶ Sets the value of
relativeError
.New in version 2.0.0.
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
RegexTokenizer
(minTokenLength=1, gaps=True, pattern='\s+', inputCol=None, outputCol=None, toLowercase=True)[source]¶ A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.
>>> df = spark.createDataFrame([("A B c",)], ["text"]) >>> reTokenizer = RegexTokenizer(inputCol="text", outputCol="words") >>> reTokenizer.transform(df).head() Row(text='A B c', words=['a', 'b', 'c']) >>> # Change a parameter. >>> reTokenizer.setParams(outputCol="tokens").transform(df).head() Row(text='A B c', tokens=['a', 'b', 'c']) >>> # Temporarily modify a parameter. >>> reTokenizer.transform(df, {reTokenizer.outputCol: "words"}).head() Row(text='A B c', words=['a', 'b', 'c']) >>> reTokenizer.transform(df).head() Row(text='A B c', tokens=['a', 'b', 'c']) >>> # Must use keyword arguments to specify params. >>> reTokenizer.setParams("text") Traceback (most recent call last): ... TypeError: Method setParams forces keyword arguments. >>> regexTokenizerPath = temp_path + "/regex-tokenizer" >>> reTokenizer.save(regexTokenizerPath) >>> loadedReTokenizer = RegexTokenizer.load(regexTokenizerPath) >>> loadedReTokenizer.getMinTokenLength() == reTokenizer.getMinTokenLength() True >>> loadedReTokenizer.getGaps() == reTokenizer.getGaps() True
New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
gaps
= Param(parent='undefined', name='gaps', doc='whether regex splits on gaps (True) or matches tokens (False)')¶
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getMinTokenLength
()[source]¶ Gets the value of minTokenLength or its default value.
New in version 1.4.0.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
minTokenLength
= Param(parent='undefined', name='minTokenLength', doc='minimum token length (>= 0)')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
pattern
= Param(parent='undefined', name='pattern', doc='regex pattern (Java dialect) used for tokenizing')¶
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setMinTokenLength
(value)[source]¶ Sets the value of
minTokenLength
.New in version 1.4.0.
-
setParams
(self, minTokenLength=1, gaps=True, pattern="s+", inputCol=None, outputCol=None, toLowercase=True)[source]¶ Sets params for this RegexTokenizer.
New in version 1.4.0.
-
setToLowercase
(value)[source]¶ Sets the value of
toLowercase
.New in version 2.0.0.
-
toLowercase
= Param(parent='undefined', name='toLowercase', doc='whether to convert all characters to lowercase before tokenizing')¶
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
RFormula
(formula=None, featuresCol='features', labelCol='label', forceIndexLabel=False)[source]¶ Note
Experimental
Implements the transforms required for fitting a dataset against an R model formula. Currently we support a limited subset of the R operators, including ‘~’, ‘.’, ‘:’, ‘+’, and ‘-‘. Also see the R formula docs.
>>> df = spark.createDataFrame([ ... (1.0, 1.0, "a"), ... (0.0, 2.0, "b"), ... (0.0, 0.0, "a") ... ], ["y", "x", "s"]) >>> rf = RFormula(formula="y ~ x + s") >>> model = rf.fit(df) >>> model.transform(df).show() +---+---+---+---------+-----+ | y| x| s| features|label| +---+---+---+---------+-----+ |1.0|1.0| a|[1.0,1.0]| 1.0| |0.0|2.0| b|[2.0,0.0]| 0.0| |0.0|0.0| a|[0.0,1.0]| 0.0| +---+---+---+---------+-----+ ... >>> rf.fit(df, {rf.formula: "y ~ . - s"}).transform(df).show() +---+---+---+--------+-----+ | y| x| s|features|label| +---+---+---+--------+-----+ |1.0|1.0| a| [1.0]| 1.0| |0.0|2.0| b| [2.0]| 0.0| |0.0|0.0| a| [0.0]| 0.0| +---+---+---+--------+-----+ ... >>> rFormulaPath = temp_path + "/rFormula" >>> rf.save(rFormulaPath) >>> loadedRF = RFormula.load(rFormulaPath) >>> loadedRF.getFormula() == rf.getFormula() True >>> loadedRF.getFeaturesCol() == rf.getFeaturesCol() True >>> loadedRF.getLabelCol() == rf.getLabelCol() True >>> str(loadedRF) 'RFormula(y ~ x + s) (uid=...)' >>> modelPath = temp_path + "/rFormulaModel" >>> model.save(modelPath) >>> loadedModel = RFormulaModel.load(modelPath) >>> loadedModel.uid == model.uid True >>> loadedModel.transform(df).show() +---+---+---+---------+-----+ | y| x| s| features|label| +---+---+---+---------+-----+ |1.0|1.0| a|[1.0,1.0]| 1.0| |0.0|2.0| b|[2.0,0.0]| 0.0| |0.0|0.0| a|[0.0,1.0]| 0.0| +---+---+---+---------+-----+ ... >>> str(loadedModel) 'RFormulaModel(ResolvedRFormula(label=y, terms=[x,s], hasIntercept=true)) (uid=...)'
New in version 1.5.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name.')¶
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
forceIndexLabel
= Param(parent='undefined', name='forceIndexLabel', doc='Force to index label whether it is numeric or string')¶
-
formula
= Param(parent='undefined', name='formula', doc='R model formula')¶
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getForceIndexLabel
()[source]¶ Gets the value of
forceIndexLabel
.New in version 2.1.0.
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name.')¶
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setFeaturesCol
(value)¶ Sets the value of
featuresCol
.
-
setForceIndexLabel
(value)[source]¶ Sets the value of
forceIndexLabel
.New in version 2.1.0.
-
setParams
(self, formula=None, featuresCol="features", labelCol="label", forceIndexLabel=False)[source]¶ Sets params for RFormula.
New in version 1.5.0.
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
RFormulaModel
(java_model=None)[source]¶ Note
Experimental
Model fitted by
RFormula
. Fitting is required to determine the factor levels of formula terms.New in version 1.5.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
SQLTransformer
(statement=None)[source]¶ Implements the transforms which are defined by SQL statement. Currently we only support SQL syntax like ‘SELECT … FROM __THIS__’ where ‘__THIS__’ represents the underlying table of the input dataset.
>>> df = spark.createDataFrame([(0, 1.0, 3.0), (2, 2.0, 5.0)], ["id", "v1", "v2"]) >>> sqlTrans = SQLTransformer( ... statement="SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__") >>> sqlTrans.transform(df).head() Row(id=0, v1=1.0, v2=3.0, v3=4.0, v4=3.0) >>> sqlTransformerPath = temp_path + "/sql-transformer" >>> sqlTrans.save(sqlTransformerPath) >>> loadedSqlTrans = SQLTransformer.load(sqlTransformerPath) >>> loadedSqlTrans.getStatement() == sqlTrans.getStatement() True
New in version 1.6.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
statement
= Param(parent='undefined', name='statement', doc='SQL statement')¶
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
StandardScaler
(withMean=False, withStd=True, inputCol=None, outputCol=None)[source]¶ Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
The “unit std” is computed using the corrected sample standard deviation, which is computed as the square root of the unbiased sample variance.
>>> from pyspark.ml.linalg import Vectors >>> df = spark.createDataFrame([(Vectors.dense([0.0]),), (Vectors.dense([2.0]),)], ["a"]) >>> standardScaler = StandardScaler(inputCol="a", outputCol="scaled") >>> model = standardScaler.fit(df) >>> model.mean DenseVector([1.0]) >>> model.std DenseVector([1.4142]) >>> model.transform(df).collect()[1].scaled DenseVector([1.4142]) >>> standardScalerPath = temp_path + "/standard-scaler" >>> standardScaler.save(standardScalerPath) >>> loadedStandardScaler = StandardScaler.load(standardScalerPath) >>> loadedStandardScaler.getWithMean() == standardScaler.getWithMean() True >>> loadedStandardScaler.getWithStd() == standardScaler.getWithStd() True >>> modelPath = temp_path + "/standard-scaler-model" >>> model.save(modelPath) >>> loadedModel = StandardScalerModel.load(modelPath) >>> loadedModel.std == model.std True >>> loadedModel.mean == model.mean True
New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setParams
(self, withMean=False, withStd=True, inputCol=None, outputCol=None)[source]¶ Sets params for this StandardScaler.
New in version 1.4.0.
-
withMean
= Param(parent='undefined', name='withMean', doc='Center data with mean')¶
-
withStd
= Param(parent='undefined', name='withStd', doc='Scale to unit standard deviation')¶
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
StandardScalerModel
(java_model=None)[source]¶ Model fitted by
StandardScaler
.New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
mean
¶ Mean of the StandardScalerModel.
New in version 2.0.0.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
std
¶ Standard deviation of the StandardScalerModel.
New in version 2.0.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
StopWordsRemover
(inputCol=None, outputCol=None, stopWords=None, caseSensitive=False)[source]¶ A feature transformer that filters out stop words from input.
Note
null values from input array are preserved unless adding null to stopWords explicitly.
>>> df = spark.createDataFrame([(["a", "b", "c"],)], ["text"]) >>> remover = StopWordsRemover(inputCol="text", outputCol="words", stopWords=["b"]) >>> remover.transform(df).head().words == ['a', 'c'] True >>> stopWordsRemoverPath = temp_path + "/stopwords-remover" >>> remover.save(stopWordsRemoverPath) >>> loadedRemover = StopWordsRemover.load(stopWordsRemoverPath) >>> loadedRemover.getStopWords() == remover.getStopWords() True >>> loadedRemover.getCaseSensitive() == remover.getCaseSensitive() True
New in version 1.6.0.
-
caseSensitive
= Param(parent='undefined', name='caseSensitive', doc='whether to do a case sensitive comparison over the stop words')¶
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getCaseSensitive
()[source]¶ Gets the value of
caseSensitive
or its default value.New in version 1.6.0.
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
static
loadDefaultStopWords
(language)[source]¶ Loads the default stop words for the given language. Supported languages: danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish, swedish, turkish
New in version 2.0.0.
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setCaseSensitive
(value)[source]¶ Sets the value of
caseSensitive
.New in version 1.6.0.
-
setParams
(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=false)[source]¶ Sets params for this StopWordRemover.
New in version 1.6.0.
-
stopWords
= Param(parent='undefined', name='stopWords', doc='The words to be filtered out')¶
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
StringIndexer
(inputCol=None, outputCol=None, handleInvalid='error')[source]¶ A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels), ordered by label frequencies. So the most frequent label gets index 0.
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed", handleInvalid='error') >>> model = stringIndexer.fit(stringIndDf) >>> td = model.transform(stringIndDf) >>> sorted(set([(i[0], i[1]) for i in td.select(td.id, td.indexed).collect()]), ... key=lambda x: x[0]) [(0, 0.0), (1, 2.0), (2, 1.0), (3, 0.0), (4, 0.0), (5, 1.0)] >>> inverter = IndexToString(inputCol="indexed", outputCol="label2", labels=model.labels) >>> itd = inverter.transform(td) >>> sorted(set([(i[0], str(i[1])) for i in itd.select(itd.id, itd.label2).collect()]), ... key=lambda x: x[0]) [(0, 'a'), (1, 'b'), (2, 'c'), (3, 'a'), (4, 'a'), (5, 'c')] >>> stringIndexerPath = temp_path + "/string-indexer" >>> stringIndexer.save(stringIndexerPath) >>> loadedIndexer = StringIndexer.load(stringIndexerPath) >>> loadedIndexer.getHandleInvalid() == stringIndexer.getHandleInvalid() True >>> modelPath = temp_path + "/string-indexer-model" >>> model.save(modelPath) >>> loadedModel = StringIndexerModel.load(modelPath) >>> loadedModel.labels == model.labels True >>> indexToStringPath = temp_path + "/index-to-string" >>> inverter.save(indexToStringPath) >>> loadedInverter = IndexToString.load(indexToStringPath) >>> loadedInverter.getLabels() == inverter.getLabels() True
New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
getHandleInvalid
()¶ Gets the value of handleInvalid or its default value.
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
handleInvalid
= Param(parent='undefined', name='handleInvalid', doc='how to handle invalid entries. Options are skip (which will filter out rows with bad values), or error (which will throw an error). More options may be added later.')¶
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setHandleInvalid
(value)¶ Sets the value of
handleInvalid
.
-
setParams
(self, inputCol=None, outputCol=None, handleInvalid="error")[source]¶ Sets params for this StringIndexer.
New in version 1.4.0.
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
StringIndexerModel
(java_model=None)[source]¶ Model fitted by
StringIndexer
.New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labels
¶ Ordered list of labels, corresponding to indices to be assigned.
New in version 1.5.0.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
Tokenizer
(inputCol=None, outputCol=None)[source]¶ A tokenizer that converts the input string to lowercase and then splits it by white spaces.
>>> df = spark.createDataFrame([("a b c",)], ["text"]) >>> tokenizer = Tokenizer(inputCol="text", outputCol="words") >>> tokenizer.transform(df).head() Row(text='a b c', words=['a', 'b', 'c']) >>> # Change a parameter. >>> tokenizer.setParams(outputCol="tokens").transform(df).head() Row(text='a b c', tokens=['a', 'b', 'c']) >>> # Temporarily modify a parameter. >>> tokenizer.transform(df, {tokenizer.outputCol: "words"}).head() Row(text='a b c', words=['a', 'b', 'c']) >>> tokenizer.transform(df).head() Row(text='a b c', tokens=['a', 'b', 'c']) >>> # Must use keyword arguments to specify params. >>> tokenizer.setParams("text") Traceback (most recent call last): ... TypeError: Method setParams forces keyword arguments. >>> tokenizerPath = temp_path + "/tokenizer" >>> tokenizer.save(tokenizerPath) >>> loadedTokenizer = Tokenizer.load(tokenizerPath) >>> loadedTokenizer.transform(df).head().tokens == tokenizer.transform(df).head().tokens True
New in version 1.3.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setParams
(self, inputCol=None, outputCol=None)[source]¶ Sets params for this Tokenizer.
New in version 1.3.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
VectorAssembler
(inputCols=None, outputCol=None)[source]¶ A feature transformer that merges multiple columns into a vector column.
>>> df = spark.createDataFrame([(1, 0, 3)], ["a", "b", "c"]) >>> vecAssembler = VectorAssembler(inputCols=["a", "b", "c"], outputCol="features") >>> vecAssembler.transform(df).head().features DenseVector([1.0, 0.0, 3.0]) >>> vecAssembler.setParams(outputCol="freqs").transform(df).head().freqs DenseVector([1.0, 0.0, 3.0]) >>> params = {vecAssembler.inputCols: ["b", "a"], vecAssembler.outputCol: "vector"} >>> vecAssembler.transform(df, params).head().vector DenseVector([0.0, 1.0]) >>> vectorAssemblerPath = temp_path + "/vector-assembler" >>> vecAssembler.save(vectorAssemblerPath) >>> loadedAssembler = VectorAssembler.load(vectorAssemblerPath) >>> loadedAssembler.transform(df).head().freqs == vecAssembler.transform(df).head().freqs True
New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getInputCols
()¶ Gets the value of inputCols or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCols
= Param(parent='undefined', name='inputCols', doc='input column names.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setParams
(self, inputCols=None, outputCol=None)[source]¶ Sets params for this VectorAssembler.
New in version 1.4.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
VectorIndexer
(maxCategories=20, inputCol=None, outputCol=None)[source]¶ Class for indexing categorical feature columns in a dataset of Vector.
- This has 2 usage modes:
- Automatically identify categorical features (default behavior)
- This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter.
- Set maxCategories to the maximum number of categorical any categorical feature should have.
- E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories = 2, then feature 0 will be declared categorical and use indices {0, 1}, and feature 1 will be declared continuous.
- Index all features, if all features are categorical
- If maxCategories is set to be very large, then this will build an index of unique values for all features.
- Warning: This can cause problems if features are continuous since this will collect ALL unique values to the driver.
- E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories >= 3, then both features will be declared categorical.
This returns a model which can transform categorical features to use 0-based indices.
- Index stability:
- This is not guaranteed to choose the same category index across multiple runs.
- If a categorical feature includes value 0, then this is guaranteed to map value 0 to index 0. This maintains vector sparsity.
- More stability may be added in the future.
- TODO: Future extensions: The following functionality is planned for the future:
- Preserve metadata in transform; if a feature’s metadata is already present, do not recompute.
- Specify certain features to not index, either via a parameter or via existing metadata.
- Add warning if a categorical feature has only 1 category.
- Add option for allowing unknown categories.
>>> from pyspark.ml.linalg import Vectors >>> df = spark.createDataFrame([(Vectors.dense([-1.0, 0.0]),), ... (Vectors.dense([0.0, 1.0]),), (Vectors.dense([0.0, 2.0]),)], ["a"]) >>> indexer = VectorIndexer(maxCategories=2, inputCol="a", outputCol="indexed") >>> model = indexer.fit(df) >>> model.transform(df).head().indexed DenseVector([1.0, 0.0]) >>> model.numFeatures 2 >>> model.categoryMaps {0: {0.0: 0, -1.0: 1}} >>> indexer.setParams(outputCol="test").fit(df).transform(df).collect()[1].test DenseVector([0.0, 1.0]) >>> params = {indexer.maxCategories: 3, indexer.outputCol: "vector"} >>> model2 = indexer.fit(df, params) >>> model2.transform(df).head().vector DenseVector([1.0, 0.0]) >>> vectorIndexerPath = temp_path + "/vector-indexer" >>> indexer.save(vectorIndexerPath) >>> loadedIndexer = VectorIndexer.load(vectorIndexerPath) >>> loadedIndexer.getMaxCategories() == indexer.getMaxCategories() True >>> modelPath = temp_path + "/vector-indexer-model" >>> model.save(modelPath) >>> loadedModel = VectorIndexerModel.load(modelPath) >>> loadedModel.numFeatures == model.numFeatures True >>> loadedModel.categoryMaps == model.categoryMaps True
New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getMaxCategories
()[source]¶ Gets the value of maxCategories or its default value.
New in version 1.4.0.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
maxCategories
= Param(parent='undefined', name='maxCategories', doc='Threshold for the number of values a categorical feature can take (>= 2). If a feature is found to have > maxCategories values, then it is declared continuous.')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setMaxCategories
(value)[source]¶ Sets the value of
maxCategories
.New in version 1.4.0.
-
setParams
(self, maxCategories=20, inputCol=None, outputCol=None)[source]¶ Sets params for this VectorIndexer.
New in version 1.4.0.
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
class
pyspark.ml.feature.
VectorIndexerModel
(java_model=None)[source]¶ Model fitted by
VectorIndexer
.- Transform categorical features to use 0-based indices instead of their original values.
- Categorical features are mapped to indices.
- Continuous features (columns) are left unchanged.
This also appends metadata to the output column, marking features as Numeric (continuous), Nominal (categorical), or Binary (either continuous or categorical). Non-ML metadata is not carried over from the input to the output column.
This maintains vector sparsity.
New in version 1.4.0.
-
categoryMaps
¶ Feature value index. Keys are categorical feature indices (column indices). Values are maps from original features values to 0-based category indices. If a feature is not in this map, it is treated as continuous.
New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
numFeatures
¶ Number of features, i.e., length of Vectors which this transforms.
New in version 1.4.0.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
class
pyspark.ml.feature.
VectorSlicer
(inputCol=None, outputCol=None, indices=None, names=None)[source]¶ This class takes a feature vector and outputs a new feature vector with a subarray of the original features.
The subset of features can be specified with either indices (setIndices()) or names (setNames()). At least one feature must be selected. Duplicate features are not allowed, so there can be no overlap between selected indices and names.
The output vector will order features with the selected indices first (in the order given), followed by the selected names (in the order given).
>>> from pyspark.ml.linalg import Vectors >>> df = spark.createDataFrame([ ... (Vectors.dense([-2.0, 2.3, 0.0, 0.0, 1.0]),), ... (Vectors.dense([0.0, 0.0, 0.0, 0.0, 0.0]),), ... (Vectors.dense([0.6, -1.1, -3.0, 4.5, 3.3]),)], ["features"]) >>> vs = VectorSlicer(inputCol="features", outputCol="sliced", indices=[1, 4]) >>> vs.transform(df).head().sliced DenseVector([2.3, 1.0]) >>> vectorSlicerPath = temp_path + "/vector-slicer" >>> vs.save(vectorSlicerPath) >>> loadedVs = VectorSlicer.load(vectorSlicerPath) >>> loadedVs.getIndices() == vs.getIndices() True >>> loadedVs.getNames() == vs.getNames() True
New in version 1.6.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
indices
= Param(parent='undefined', name='indices', doc='An array of indices to select features from a vector column. There can be no overlap with names.')¶
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
names
= Param(parent='undefined', name='names', doc='An array of feature names to select features from a vector column. These names must be specified by ML org.apache.spark.ml.attribute.Attribute. There can be no overlap with indices.')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setParams
(inputCol=None, outputCol=None, indices=None, names=None)[source]¶ setParams(self, inputCol=None, outputCol=None, indices=None, names=None): Sets params for this VectorSlicer.
New in version 1.6.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
Word2Vec
(vectorSize=100, minCount=5, numPartitions=1, stepSize=0.025, maxIter=1, seed=None, inputCol=None, outputCol=None, windowSize=5, maxSentenceLength=1000)[source]¶ Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.
>>> sent = ("a b " * 100 + "a c " * 10).split(" ") >>> doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"]) >>> word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model") >>> model = word2Vec.fit(doc) >>> model.getVectors().show() +----+--------------------+ |word| vector| +----+--------------------+ | a|[0.09461779892444...| | b|[1.15474212169647...| | c|[-0.3794820010662...| +----+--------------------+ ... >>> from pyspark.sql.functions import format_number as fmt >>> model.findSynonyms("a", 2).select("word", fmt("similarity", 5).alias("similarity")).show() +----+----------+ |word|similarity| +----+----------+ | b| 0.25053| | c| -0.69805| +----+----------+ ... >>> model.transform(doc).head().model DenseVector([0.5524, -0.4995, -0.3599, 0.0241, 0.3461]) >>> word2vecPath = temp_path + "/word2vec" >>> word2Vec.save(word2vecPath) >>> loadedWord2Vec = Word2Vec.load(word2vecPath) >>> loadedWord2Vec.getVectorSize() == word2Vec.getVectorSize() True >>> loadedWord2Vec.getNumPartitions() == word2Vec.getNumPartitions() True >>> loadedWord2Vec.getMinCount() == word2Vec.getMinCount() True >>> modelPath = temp_path + "/word2vec-model" >>> model.save(modelPath) >>> loadedModel = Word2VecModel.load(modelPath) >>> loadedModel.getVectors().first().word == model.getVectors().first().word True >>> loadedModel.getVectors().first().vector == model.getVectors().first().vector True
New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getMaxIter
()¶ Gets the value of maxIter or its default value.
-
getMaxSentenceLength
()[source]¶ Gets the value of maxSentenceLength or its default value.
New in version 2.0.0.
-
getNumPartitions
()[source]¶ Gets the value of numPartitions or its default value.
New in version 1.4.0.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
getSeed
()¶ Gets the value of seed or its default value.
-
getStepSize
()¶ Gets the value of stepSize or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
maxIter
= Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0).')¶
-
maxSentenceLength
= Param(parent='undefined', name='maxSentenceLength', doc='Maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks up to the size.')¶
-
minCount
= Param(parent='undefined', name='minCount', doc="the minimum number of times a token must appear to be included in the word2vec model's vocabulary")¶
-
numPartitions
= Param(parent='undefined', name='numPartitions', doc='number of partitions for sentences of words')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
seed
= Param(parent='undefined', name='seed', doc='random seed.')¶
-
setMaxSentenceLength
(value)[source]¶ Sets the value of
maxSentenceLength
.New in version 2.0.0.
-
setNumPartitions
(value)[source]¶ Sets the value of
numPartitions
.New in version 1.4.0.
-
setParams
(self, minCount=5, numPartitions=1, stepSize=0.025, maxIter=1, seed=None, inputCol=None, outputCol=None, windowSize=5, maxSentenceLength=1000)[source]¶ Sets params for this Word2Vec.
New in version 1.4.0.
-
setVectorSize
(value)[source]¶ Sets the value of
vectorSize
.New in version 1.4.0.
-
setWindowSize
(value)[source]¶ Sets the value of
windowSize
.New in version 2.0.0.
-
stepSize
= Param(parent='undefined', name='stepSize', doc='Step size to be used for each iteration of optimization (>= 0).')¶
-
vectorSize
= Param(parent='undefined', name='vectorSize', doc='the dimension of codes after transforming from words')¶
-
windowSize
= Param(parent='undefined', name='windowSize', doc='the window size (context words from [-window, window]). Default value is 5')¶
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.feature.
Word2VecModel
(java_model=None)[source]¶ Model fitted by
Word2Vec
.New in version 1.4.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
findSynonyms
(word, num)[source]¶ Find “num” number of words closest in similarity to “word”. word can be a string or vector representation. Returns a dataframe with two fields word and similarity (which gives the cosine similarity).
New in version 1.5.0.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getVectors
()[source]¶ Returns the vector representation of the words as a dataframe with two fields, word and vector.
New in version 1.5.0.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
pyspark.ml.classification module¶
-
class
pyspark.ml.classification.
LinearSVC
(featuresCol='features', labelCol='label', predictionCol='prediction', maxIter=100, regParam=0.0, tol=1e-06, rawPredictionCol='rawPrediction', fitIntercept=True, standardization=True, threshold=0.0, weightCol=None, aggregationDepth=2)[source]¶ Note
Experimental
This binary classifier optimizes the Hinge Loss using the OWLQN optimizer. Only supports L2 regularization currently.
>>> from pyspark.sql import Row >>> from pyspark.ml.linalg import Vectors >>> df = sc.parallelize([ ... Row(label=1.0, features=Vectors.dense(1.0, 1.0, 1.0)), ... Row(label=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF() >>> svm = LinearSVC(maxIter=5, regParam=0.01) >>> model = svm.fit(df) >>> model.coefficients DenseVector([0.0, -0.2792, -0.1833]) >>> model.intercept 1.0206118982229047 >>> model.numClasses 2 >>> model.numFeatures 3 >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, -1.0, -1.0))]).toDF() >>> result = model.transform(test0).head() >>> result.prediction 1.0 >>> result.rawPrediction DenseVector([-1.4831, 1.4831]) >>> svm_path = temp_path + "/svm" >>> svm.save(svm_path) >>> svm2 = LinearSVC.load(svm_path) >>> svm2.getMaxIter() 5 >>> model_path = temp_path + "/svm_model" >>> model.save(model_path) >>> model2 = LinearSVCModel.load(model_path) >>> model.coefficients[0] == model2.coefficients[0] True >>> model.intercept == model2.intercept True
New in version 2.2.0.
-
aggregationDepth
= Param(parent='undefined', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).')¶
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name.')¶
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
fitIntercept
= Param(parent='undefined', name='fitIntercept', doc='whether to fit an intercept term.')¶
-
getAggregationDepth
()¶ Gets the value of aggregationDepth or its default value.
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getFitIntercept
()¶ Gets the value of fitIntercept or its default value.
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getMaxIter
()¶ Gets the value of maxIter or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getPredictionCol
()¶ Gets the value of predictionCol or its default value.
-
getRawPredictionCol
()¶ Gets the value of rawPredictionCol or its default value.
-
getRegParam
()¶ Gets the value of regParam or its default value.
-
getStandardization
()¶ Gets the value of standardization or its default value.
-
getTol
()¶ Gets the value of tol or its default value.
-
getWeightCol
()¶ Gets the value of weightCol or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name.')¶
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
maxIter
= Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0).')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
predictionCol
= Param(parent='undefined', name='predictionCol', doc='prediction column name.')¶
-
rawPredictionCol
= Param(parent='undefined', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.')¶
-
read
()¶ Returns an MLReader instance for this class.
-
regParam
= Param(parent='undefined', name='regParam', doc='regularization parameter (>= 0).')¶
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setAggregationDepth
(value)¶ Sets the value of
aggregationDepth
.
-
setFeaturesCol
(value)¶ Sets the value of
featuresCol
.
-
setFitIntercept
(value)¶ Sets the value of
fitIntercept
.
-
setParams
(featuresCol='features', labelCol='label', predictionCol='prediction', maxIter=100, regParam=0.0, tol=1e-06, rawPredictionCol='rawPrediction', fitIntercept=True, standardization=True, threshold=0.0, weightCol=None, aggregationDepth=2)[source]¶ setParams(self, featuresCol=”features”, labelCol=”label”, predictionCol=”prediction”, maxIter=100, regParam=0.0, tol=1e-6, rawPredictionCol=”rawPrediction”, fitIntercept=True, standardization=True, threshold=0.0, weightCol=None, aggregationDepth=2): Sets params for Linear SVM Classifier.
New in version 2.2.0.
-
setPredictionCol
(value)¶ Sets the value of
predictionCol
.
-
setRawPredictionCol
(value)¶ Sets the value of
rawPredictionCol
.
-
setStandardization
(value)¶ Sets the value of
standardization
.
-
standardization
= Param(parent='undefined', name='standardization', doc='whether to standardize the training features before fitting the model.')¶
-
threshold
= Param(parent='undefined', name='threshold', doc='The threshold in binary classification applied to the linear model prediction. This threshold can be any real number, where Inf will make all predictions 0.0 and -Inf will make all predictions 1.0.')¶
-
tol
= Param(parent='undefined', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0).')¶
-
weightCol
= Param(parent='undefined', name='weightCol', doc='weight column name. If this is not set or empty, we treat all instance weights as 1.0.')¶
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.classification.
LinearSVCModel
(java_model=None)[source]¶ Note
Experimental
Model fitted by LinearSVC.
New in version 2.2.0.
-
coefficients
¶ Model coefficients of Linear SVM Classifier.
New in version 2.2.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
intercept
¶ Model intercept of Linear SVM Classifier.
New in version 2.2.0.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
numClasses
¶ Number of classes (values which the label can take).
New in version 2.1.0.
-
numFeatures
¶ Returns the number of features the model was trained on. If unknown, returns -1
New in version 2.1.0.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.classification.
LogisticRegression
(featuresCol='features', labelCol='label', predictionCol='prediction', maxIter=100, regParam=0.0, elasticNetParam=0.0, tol=1e-06, fitIntercept=True, threshold=0.5, thresholds=None, probabilityCol='probability', rawPredictionCol='rawPrediction', standardization=True, weightCol=None, aggregationDepth=2, family='auto')[source]¶ Logistic regression. This class supports multinomial logistic (softmax) and binomial logistic regression.
>>> from pyspark.sql import Row >>> from pyspark.ml.linalg import Vectors >>> bdf = sc.parallelize([ ... Row(label=1.0, weight=1.0, features=Vectors.dense(0.0, 5.0)), ... Row(label=0.0, weight=2.0, features=Vectors.dense(1.0, 2.0)), ... Row(label=1.0, weight=3.0, features=Vectors.dense(2.0, 1.0)), ... Row(label=0.0, weight=4.0, features=Vectors.dense(3.0, 3.0))]).toDF() >>> blor = LogisticRegression(regParam=0.01, weightCol="weight") >>> blorModel = blor.fit(bdf) >>> blorModel.coefficients DenseVector([-1.080..., -0.646...]) >>> blorModel.intercept 3.112... >>> data_path = "data/mllib/sample_multiclass_classification_data.txt" >>> mdf = spark.read.format("libsvm").load(data_path) >>> mlor = LogisticRegression(regParam=0.1, elasticNetParam=1.0, family="multinomial") >>> mlorModel = mlor.fit(mdf) >>> mlorModel.coefficientMatrix SparseMatrix(3, 4, [0, 1, 2, 3], [3, 2, 1], [1.87..., -2.75..., -0.50...], 1) >>> mlorModel.interceptVector DenseVector([0.04..., -0.42..., 0.37...]) >>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 1.0))]).toDF() >>> result = blorModel.transform(test0).head() >>> result.prediction 1.0 >>> result.probability DenseVector([0.02..., 0.97...]) >>> result.rawPrediction DenseVector([-3.54..., 3.54...]) >>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF() >>> blorModel.transform(test1).head().prediction 1.0 >>> blor.setParams("vector") Traceback (most recent call last): ... TypeError: Method setParams forces keyword arguments. >>> lr_path = temp_path + "/lr" >>> blor.save(lr_path) >>> lr2 = LogisticRegression.load(lr_path) >>> lr2.getRegParam() 0.01 >>> model_path = temp_path + "/lr_model" >>> blorModel.save(model_path) >>> model2 = LogisticRegressionModel.load(model_path) >>> blorModel.coefficients[0] == model2.coefficients[0] True >>> blorModel.intercept == model2.intercept True
New in version 1.3.0.
-
aggregationDepth
= Param(parent='undefined', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).')¶
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
elasticNetParam
= Param(parent='undefined', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.')¶
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
family
= Param(parent='undefined', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial')¶
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name.')¶
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Returns: fitted model(s)
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
fitIntercept
= Param(parent='undefined', name='fitIntercept', doc='whether to fit an intercept term.')¶
-
getAggregationDepth
()¶ Gets the value of aggregationDepth or its default value.
-
getElasticNetParam
()¶ Gets the value of elasticNetParam or its default value.
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getFitIntercept
()¶ Gets the value of fitIntercept or its default value.
-
getLabelCol
()¶ Gets the value of labelCol or its default value.
-
getMaxIter
()¶ Gets the value of maxIter or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getPredictionCol
()¶ Gets the value of predictionCol or its default value.
-
getProbabilityCol
()¶ Gets the value of probabilityCol or its default value.
-
getRawPredictionCol
()¶ Gets the value of rawPredictionCol or its default value.
-
getRegParam
()¶ Gets the value of regParam or its default value.
-
getStandardization
()¶ Gets the value of standardization or its default value.
-
getThreshold
()[source]¶ Get threshold for binary classification.
If
thresholds
is set with length 2 (i.e., binary classification), this returns the equivalent threshold: \(\frac{1}{1 + \frac{thresholds(0)}{thresholds(1)}}\). Otherwise, returnsthreshold
if set or its default value if unset.New in version 1.4.0.
-
getThresholds
()[source]¶ If
thresholds
is set, return its value. Otherwise, ifthreshold
is set, return the equivalent thresholds for binary classification: (1-threshold, threshold). If neither are set, throw an error.New in version 1.5.0.
-
getTol
()¶ Gets the value of tol or its default value.
-
getWeightCol
()¶ Gets the value of weightCol or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
labelCol
= Param(parent='undefined', name='labelCol', doc='label column name.')¶
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
maxIter
= Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0).')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
predictionCol
= Param(parent='undefined', name='predictionCol', doc='prediction column name.')¶
-
probabilityCol
= Param(parent='undefined', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.')¶
-
rawPredictionCol
= Param(parent='undefined', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.')¶
-
read
()¶ Returns an MLReader instance for this class.
-
regParam
= Param(parent='undefined', name='regParam', doc='regularization parameter (>= 0).')¶
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
setAggregationDepth
(value)¶ Sets the value of
aggregationDepth
.
-
setElasticNetParam
(value)¶ Sets the value of
elasticNetParam
.
-
setFeaturesCol
(value)¶ Sets the value of
featuresCol
.
-
setFitIntercept
(value)¶ Sets the value of
fitIntercept
.
-
setParams
(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, regParam=0.0, elasticNetParam=0.0, tol=1e-6, fitIntercept=True, threshold=0.5, thresholds=None, probabilityCol="probability", rawPredictionCol="rawPrediction", standardization=True, weightCol=None, aggregationDepth=2, family="auto")[source]¶ Sets params for logistic regression. If the threshold and thresholds Params are both set, they must be equivalent.
New in version 1.3.0.
-
setPredictionCol
(value)¶ Sets the value of
predictionCol
.
-
setProbabilityCol
(value)¶ Sets the value of
probabilityCol
.
-
setRawPredictionCol
(value)¶ Sets the value of
rawPredictionCol
.
-
setStandardization
(value)¶ Sets the value of
standardization
.
-
setThreshold
(value)[source]¶ Sets the value of
threshold
. Clears value ofthresholds
if it has been set.New in version 1.4.0.
-
setThresholds
(value)[source]¶ Sets the value of
thresholds
. Clears value ofthreshold
if it has been set.New in version 1.5.0.
-
standardization
= Param(parent='undefined', name='standardization', doc='whether to standardize the training features before fitting the model.')¶
-
threshold
= Param(parent='undefined', name='threshold', doc='Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds must be equal to [1-p, p].')¶
-
thresholds
= Param(parent='undefined', name='thresholds', doc="Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0, excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.")¶
-
tol
= Param(parent='undefined', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0).')¶
-
weightCol
= Param(parent='undefined', name='weightCol', doc='weight column name. If this is not set or empty, we treat all instance weights as 1.0.')¶
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.classification.
LogisticRegressionModel
(java_model=None)[source]¶ Model fitted by LogisticRegression.
New in version 1.3.0.
-
coefficientMatrix
¶ Model coefficients.
New in version 2.1.0.
-
coefficients
¶ Model coefficients of binomial logistic regression. An exception is thrown in the case of multinomial logistic regression.
New in version 2.0.0.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: extra – Extra parameters to copy to the new instance Returns: Copy of this instance
-
evaluate
(dataset)[source]¶ Evaluates the model on a test dataset.
Parameters: dataset – Test dataset to evaluate model on, where dataset is an instance of pyspark.sql.DataFrame
New in version 2.0.0.
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: extra – extra param values Returns: merged param map
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
hasSummary
¶ Indicates whether a training summary exists for this model instance.
New in version 2.0.0.
-
intercept
¶ Model intercept of binomial logistic regression. An exception is thrown in the case of multinomial logistic regression.
New in version 1.4.0.
-
interceptVector
¶ Model intercept.
New in version 2.1.0.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
numClasses
¶ Number of classes (values which the label can take).
New in version 2.1.0.
-
numFeatures
¶ Returns the number of features the model was trained on. If unknown, returns -1
New in version 2.1.0.
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of write().save(path).
-
summary
¶ Gets summary (e.g. accuracy/precision/recall, objective history, total iterations) of model trained on the training set. An exception is thrown if trainingSummary is None.
New in version 2.0.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
Parameters: - dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
- params – an optional param map that overrides embedded params.
Returns: transformed dataset
New in version 1.3.0.
- dataset – input dataset, which is an instance of
-
write
()¶ Returns an MLWriter instance for this ML instance.
-
-
class
pyspark.ml.classification.
LogisticRegressionSummary
(java_obj=None)[source]¶ Note
Experimental
Abstraction for Logistic Regression Results for a given model.
New in version 2.0.0.
-
featuresCol
¶ Field in “predictions” which gives the features of each instance as a vector.
New in version 2.0.0.
-
labelCol
¶ Field in “predictions” which gives the true label of each instance.
New in version 2.0.0.
-
predictions
¶ Dataframe outputted by the model’s transform method.
New in version 2.0.0.
-
probabilityCol
¶ Field in “predictions” which gives the probability of each class as a vector.
New in version 2.0.0.
-
-
class
pyspark.ml.classification.
LogisticRegressionTrainingSummary
(java_obj=None)[source]¶ Note
Experimental
Abstraction for multinomial Logistic Regression Training results. Currently, the training summary ignores the training weights except for the objective trace.
New in version 2.0.0.
-
featuresCol
¶ Field in “predictions” which gives the features of each instance as a vector.
New in version 2.0.0.
-
labelCol
¶ Field in “predictions” which gives the true label of each instance.
New in version 2.0.0.
-
objectiveHistory
¶ Objective function (scaled loss + regularization) at each iteration.
New in version 2.0.0.
-
predictions
¶ Dataframe outputted by the model’s transform method.
New in version 2.0.0.
-
probabilityCol
¶ Field in “predictions” which gives the probability of each class as a vector.
New in version 2.0.0.
-
totalIterations
¶ Number of training iterations until termination.
New in version 2.0.0.
-
-
class
pyspark.ml.classification.
BinaryLogisticRegressionSummary
(java_obj=None)[source]¶ Note
Experimental
Binary Logistic regression results for a given model.
New in version 2.0.0.
-
areaUnderROC
¶ Computes the area under the receiver operating characteristic (ROC) curve.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
-
fMeasureByThreshold
¶ Returns a dataframe with two fields (threshold, F-Measure) curve with beta = 1.0.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
-
featuresCol
¶ Field in “predictions” which gives the features of each instance as a vector.
New in version 2.0.0.
-
labelCol
¶ Field in “predictions” which gives the true label of each instance.
New in version 2.0.0.
-
pr
¶ Returns the precision-recall curve, which is a Dataframe containing two fields recall, precision with (0.0, 1.0) prepended to it.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
-
precisionByThreshold
¶ Returns a dataframe with two fields (threshold, precision) curve. Every possible probability obtained in transforming the dataset are used as thresholds used in calculating the precision.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
-
predictions
¶ Dataframe outputted by the model’s transform method.
New in version 2.0.0.
-
probabilityCol
¶ Field in “predictions” which gives the probability of each class as a vector.
New in version 2.0.0.
-
recallByThreshold
¶ Returns a dataframe with two fields (threshold, recall) curve. Every possible probability obtained in transforming the dataset are used as thresholds used in calculating the recall.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
-
roc
¶ Returns the receiver operating characteristic (ROC) curve, which is a Dataframe having two fields (FPR, TPR) with (0.0, 0.0) prepended and (1.0, 1.0) appended to it.
See also
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
-
-
class
pyspark.ml.classification.
BinaryLogisticRegressionTrainingSummary
(java_obj=None)[source]¶ Note
Experimental
Binary Logistic regression training results for a given model.
New in version 2.0.0.
-
areaUnderROC
¶ Computes the area under the receiver operating characteristic (ROC) curve.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
-
fMeasureByThreshold
¶ Returns a dataframe with two fields (threshold, F-Measure) curve with beta = 1.0.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
-
featuresCol
¶ Field in “predictions” which gives the features of each instance as a vector.
New in version 2.0.0.
-
labelCol
¶ Field in “predictions” which gives the true label of each instance.
New in version 2.0.0.
-
objectiveHistory
¶ Objective function (scaled loss + regularization) at each iteration.
New in version 2.0.0.
-
pr
¶ Returns the precision-recall curve, which is a Dataframe containing two fields recall, precision with (0.0, 1.0) prepended to it.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
-
precisionByThreshold
¶ Returns a dataframe with two fields (threshold, precision) curve. Every possible probability obtained in transforming the dataset are used as thresholds used in calculating the precision.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
-
predictions
¶ Dataframe outputted by the model’s transform method.
New in version 2.0.0.
-
probabilityCol
¶ Field in “predictions” which gives the probability of each class as a vector.
New in version 2.0.0.
-
recallByThreshold
¶ Returns a dataframe with two fields (threshold, recall) curve. Every possible probability obtained in transforming the dataset are used as thresholds used in calculating the recall.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
-
roc
¶ Returns the receiver operating characteristic (ROC) curve, which is a Dataframe having two fields (FPR, TPR) with (0.0, 0.0) prepended and (1.0, 1.0) appended to it.
See also
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
-
totalIterations
¶ Number of training iterations until termination.
New in version 2.0.0.
-
-
class
pyspark.ml.classification.
DecisionTreeClassifier
(featuresCol='features', labelCol='label', predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity='gini', seed=None)[source]¶ Decision tree learning algorithm for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.
>>> from pyspark.ml.linalg import Vectors >>> from pyspark.ml.feature import StringIndexer >>> df = spark.createDataFrame([ ... (1.0, Vectors.dense(1.0)), ... (0.0, Vectors.sparse(1, [], []))], ["label", "features"]) >>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed") >>> si_model = stringIndexer.fit(df) >>> td = si_model.transform(df) >>> dt = DecisionTreeClassifier(maxDepth=2, labelCol="indexed") >>> model = dt.fit(td) >>> model.numNodes 3 >>> model.depth 1 >>> model.featureImportances SparseVector(1, {0: 1.0}) >>> model.numFeatures 1 >>> model.numClasses 2 >>> print(model.toDebugString) DecisionTreeClassificationModel (uid=...) of depth 1 with 3 nodes... >>> test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"]) >>> result = model.transform(test0).head() >>> result.prediction 0.0 >>> result.probability DenseVector([1.0, 0.0]) >>> result.rawPrediction DenseVector([1.0, 0.0]) >>> test1 = spark.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"]) >>> model.transform(test1).head().prediction 1.0
>>> dtc_path = temp_path + "/dtc" >>> dt.save(dtc_path) >>> dt2 = DecisionTreeClassifier.load(dtc_path) >>> dt2.getMaxDepth() 2 >>> model_path = temp_path + "/dtc_model" >>> model.save(model_path) >>> model2 =