public class LeastSquaresAggregator
extends Object
implements scala.Serializable
Two LeastSquaresAggregator can be merged together to have a summary of loss and gradient of the corresponding joint dataset.
For improving the convergence rate during the optimization process, and also preventing against features with very large variances exerting an overly large influence during model training, package like R's GLMNET performs the scaling to unit variance and removing the mean to reduce the condition number, and then trains the model in scaled space but returns the coefficients in the original scale. See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
However, we don't want to apply the StandardScaler
on the training dataset, and then cache
the standardized dataset since it will create a lot of overhead. As a result, we perform the
scaling implicitly when we compute the objective function. The following is the mathematical
derivation.
Note that we don't deal with intercept by adding bias here, because the intercept can be computed using closed form after the coefficients are converged. See this discussion for detail. http://stats.stackexchange.com/questions/13617/how-is-the-intercept-computed-in-glmnet
When training with intercept enabled, The objective function in the scaled space is given by
$$ L = 1/2n ||\sum_i w_i(x_i - \bar{x_i}) / \hat{x_i} - (y - \bar{y}) / \hat{y}||^2, $$
where $\bar{x_i}$ is the mean of $x_i$, $\hat{x_i}$ is the standard deviation of $x_i$, $\bar{y}$ is the mean of label, and $\hat{y}$ is the standard deviation of label.
If we fitting the intercept disabled (that is forced through 0.0), we can use the same equation except we set $\bar{y}$ and $\bar{x_i}$ to 0 instead of the respective means.
This can be rewritten as
$$ \begin{align} L &= 1/2n ||\sum_i (w_i/\hat{x_i})x_i - \sum_i (w_i/\hat{x_i})\bar{x_i} - y / \hat{y} + \bar{y} / \hat{y}||^2 \\ &= 1/2n ||\sum_i w_i^\prime x_i - y / \hat{y} + offset||^2 = 1/2n diff^2 \end{align} $$
where $w_i^\prime$ is the effective coefficients defined by $w_i/\hat{x_i}$, offset is
$$ - \sum_i (w_i/\hat{x_i})\bar{x_i} + \bar{y} / \hat{y}. $$
and diff is
$$ \sum_i w_i^\prime x_i - y / \hat{y} + offset $$
Note that the effective coefficients and offset don't depend on training dataset, so they can be precomputed.
Now, the first derivative of the objective function in scaled space is
$$ \frac{\partial L}{\partial w_i} = diff/N (x_i - \bar{x_i}) / \hat{x_i} $$
However, $(x_i - \bar{x_i})$ will densify the computation, so it's not an ideal formula when the training dataset is sparse format.
This can be addressed by adding the dense $\bar{x_i} / \hat{x_i}$ terms in the end by keeping the sum of diff. The first derivative of total objective function from all the samples is
$$ \begin{align} \frac{\partial L}{\partial w_i} &= 1/N \sum_j diff_j (x_{ij} - \bar{x_i}) / \hat{x_i} \\ &= 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) - diffSum \bar{x_i} / \hat{x_i}) \\ &= 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) + correction_i) \end{align} $$
where $correction_i = - diffSum \bar{x_i} / \hat{x_i}$
A simple math can show that diffSum is actually zero, so we don't even need to add the correction terms in the end. From the definition of diff,
$$ \begin{align} diffSum &= \sum_j (\sum_i w_i(x_{ij} - \bar{x_i}) / \hat{x_i} - (y_j - \bar{y}) / \hat{y}) \\ &= N * (\sum_i w_i(\bar{x_i} - \bar{x_i}) / \hat{x_i} - (\bar{y} - \bar{y}) / \hat{y}) \\ &= 0 \end{align} $$
As a result, the first derivative of the total objective function only depends on the training dataset, which can be easily computed in distributed fashion, and is sparse format friendly.
$$ \frac{\partial L}{\partial w_i} = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) $$
param: bcCoefficients The broadcast coefficients corresponding to the features. param: labelStd The standard deviation value of the label. param: labelMean The mean value of the label. param: fitIntercept Whether to fit an intercept term. param: bcFeaturesStd The broadcast standard deviation values of the features. param: bcFeaturesMean The broadcast mean values of the features.
Constructor and Description |
---|
LeastSquaresAggregator(Broadcast<Vector> bcCoefficients,
double labelStd,
double labelMean,
boolean fitIntercept,
Broadcast<double[]> bcFeaturesStd,
Broadcast<double[]> bcFeaturesMean) |
Modifier and Type | Method and Description |
---|---|
LeastSquaresAggregator |
add(org.apache.spark.ml.feature.Instance instance)
Add a new training instance to this LeastSquaresAggregator, and update the loss and gradient
of the objective function.
|
long |
count() |
Vector |
gradient() |
double |
loss() |
LeastSquaresAggregator |
merge(LeastSquaresAggregator other)
Merge another LeastSquaresAggregator, and update the loss and gradient
of the objective function.
|
public LeastSquaresAggregator add(org.apache.spark.ml.feature.Instance instance)
instance
- The instance of data point to be added.public LeastSquaresAggregator merge(LeastSquaresAggregator other)
this
object will be modified.)
other
- The other LeastSquaresAggregator to be merged.public long count()
public double loss()
public Vector gradient()