StandardScaler¶

class
pyspark.mllib.feature.
StandardScaler
(withMean=False, withStd=True)[source]¶ Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
New in version 1.2.0.
 Parameters
 withMeanbool, optional
False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input.
 withStdbool, optional
True by default. Scales the data to unit standard deviation.
Examples
>>> vs = [Vectors.dense([2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])] >>> dataset = sc.parallelize(vs) >>> standardizer = StandardScaler(True, True) >>> model = standardizer.fit(dataset) >>> result = model.transform(dataset) >>> for r in result.collect(): r DenseVector([0.7071, 0.7071, 0.7071]) DenseVector([0.7071, 0.7071, 0.7071]) >>> int(model.std[0]) 4 >>> int(model.mean[0]*10) 9 >>> model.withStd True >>> model.withMean True
Methods
fit
(dataset)Computes the mean and variance and stores as a model to be used for later scaling.
Methods Documentation

fit
(dataset)[source]¶ Computes the mean and variance and stores as a model to be used for later scaling.
New in version 1.2.0.
 Parameters
 dataset
pyspark.RDD
The data used to compute the mean and variance to build the transformation model.
 dataset
 Returns