org.apache.spark.ml.regression
Class LeastSquaresAggregator

Object
  extended by org.apache.spark.ml.regression.LeastSquaresAggregator
All Implemented Interfaces:
java.io.Serializable

public class LeastSquaresAggregator
extends Object
implements scala.Serializable

LeastSquaresAggregator computes the gradient and loss for a Least-squared loss function, as used in linear regression for samples in sparse or dense vector in a online fashion.

Two LeastSquaresAggregator can be merged together to have a summary of loss and gradient of the corresponding joint dataset.

For improving the convergence rate during the optimization process, and also preventing against features with very large variances exerting an overly large influence during model training, package like R's GLMNET performs the scaling to unit variance and removing the mean to reduce the condition number, and then trains the model in scaled space but returns the weights in the original scale. See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf

However, we don't want to apply the StandardScaler on the training dataset, and then cache the standardized dataset since it will create a lot of overhead. As a result, we perform the scaling implicitly when we compute the objective function. The following is the mathematical derivation.

Note that we don't deal with intercept by adding bias here, because the intercept can be computed using closed form after the coefficients are converged. See this discussion for detail. http://stats.stackexchange.com/questions/13617/how-is-the-intercept-computed-in-glmnet

The objective function in the scaled space is given by


 L = 1/2n ||\sum_i w_i(x_i - \bar{x_i}) / \hat{x_i} - (y - \bar{y}) / \hat{y}||^2,
 
where \bar{x_i} is the mean of x_i, \hat{x_i} is the standard deviation of x_i, \bar{y} is the mean of label, and \hat{y} is the standard deviation of label.

This can be rewritten as


 L = 1/2n ||\sum_i (w_i/\hat{x_i})x_i - \sum_i (w_i/\hat{x_i})\bar{x_i} - y / \hat{y}
     + \bar{y} / \hat{y}||^2
   = 1/2n ||\sum_i w_i^\prime x_i - y / \hat{y} + offset||^2 = 1/2n diff^2
 
where w_i^\prime^ is the effective weights defined by w_i/\hat{x_i}, offset is

 - \sum_i (w_i/\hat{x_i})\bar{x_i} + \bar{y} / \hat{y}.
 
, and diff is

 \sum_i w_i^\prime x_i - y / \hat{y} + offset
 

Note that the effective weights and offset don't depend on training dataset, so they can be precomputed.

Now, the first derivative of the objective function in scaled space is


 \frac{\partial L}{\partial\w_i} = diff/N (x_i - \bar{x_i}) / \hat{x_i}
 
However, ($x_i - \bar{x_i}$) will densify the computation, so it's not an ideal formula when the training dataset is sparse format.

This can be addressed by adding the dense \bar{x_i} / \har{x_i} terms in the end by keeping the sum of diff. The first derivative of total objective function from all the samples is


 \frac{\partial L}{\partial\w_i} =
     1/N \sum_j diff_j (x_{ij} - \bar{x_i}) / \hat{x_i}
   = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) - diffSum \bar{x_i}) / \hat{x_i})
   = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) + correction_i)
 
, where correction_i = - diffSum \bar{x_i}) / \hat{x_i}

A simple math can show that diffSum is actually zero, so we don't even need to add the correction terms in the end. From the definition of diff,


 diffSum = \sum_j (\sum_i w_i(x_{ij} - \bar{x_i}) / \hat{x_i} - (y_j - \bar{y}) / \hat{y})
         = N * (\sum_i w_i(\bar{x_i} - \bar{x_i}) / \hat{x_i} - (\bar{y_j} - \bar{y}) / \hat{y})
         = 0
 

As a result, the first derivative of the total objective function only depends on the training dataset, which can be easily computed in distributed fashion, and is sparse format friendly.


 \frac{\partial L}{\partial\w_i} = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i})
 
,

param: weights The weights/coefficients corresponding to the features. param: labelStd The standard deviation value of the label. param: labelMean The mean value of the label. param: featuresStd The standard deviation values of the features. param: featuresMean The mean values of the features.

See Also:
Serialized Form

Constructor Summary
LeastSquaresAggregator(Vector weights, double labelStd, double labelMean, double[] featuresStd, double[] featuresMean)
           
 
Method Summary
 LeastSquaresAggregator add(double label, Vector data)
          Add a new training data to this LeastSquaresAggregator, and update the loss and gradient of the objective function.
 long count()
           
 Vector gradient()
           
 double loss()
           
 LeastSquaresAggregator merge(LeastSquaresAggregator other)
          Merge another LeastSquaresAggregator, and update the loss and gradient of the objective function.
 
Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LeastSquaresAggregator

public LeastSquaresAggregator(Vector weights,
                              double labelStd,
                              double labelMean,
                              double[] featuresStd,
                              double[] featuresMean)
Method Detail

add

public LeastSquaresAggregator add(double label,
                                  Vector data)
Add a new training data to this LeastSquaresAggregator, and update the loss and gradient of the objective function.

Parameters:
label - The label for this data point.
data - The features for one data point in dense/sparse vector format to be added into this aggregator.
Returns:
This LeastSquaresAggregator object.

merge

public LeastSquaresAggregator merge(LeastSquaresAggregator other)
Merge another LeastSquaresAggregator, and update the loss and gradient of the objective function. (Note that it's in place merging; as a result, this object will be modified.)

Parameters:
other - The other LeastSquaresAggregator to be merged.
Returns:
This LeastSquaresAggregator object.

count

public long count()

loss

public double loss()

gradient

public Vector gradient()