LogisticGradient (Spark 2.1.0 JavaDoc)

Object
- org.apache.spark.mllib.optimization.Gradient
- - org.apache.spark.mllib.optimization.LogisticGradient

All Implemented Interfaces:

java.io.Serializable
```
public class LogisticGradient
extends Gradient
```
:: DeveloperApi :: Compute gradient and loss for a multinomial logistic loss function, as used in multi-class classification (it is also used in binary logistic regression).
In The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, which can be downloaded from http://statweb.stanford.edu/~tibs/ElemStatLearn/ , Eq. (4.17) on page 119 gives the formula of multinomial logistic regression model. A simple calculation shows that

$$ P(y=0|x, w) = 1 / (1 + \sum_i^{K-1} \exp(x w_i))\\ P(y=1|x, w) = exp(x w_1) / (1 + \sum_i^{K-1} \exp(x w_i))\\ ...\\ P(y=K-1|x, w) = exp(x w_{K-1}) / (1 + \sum_i^{K-1} \exp(x w_i))\\ $$

for K classes multiclass classification problem.
The model weights $w = (w_1, w_2, ..., w_{K-1})^T$ becomes a matrix which has dimension of (K-1) * (N+1) if the intercepts are added. If the intercepts are not added, the dimension will be (K-1) * N.
As a result, the loss of objective function for a single instance of data can be written as
$$ \begin{align} l(w, x) &= -log P(y|x, w) = -\alpha(y) log P(y=0|x, w) - (1-\alpha(y)) log P(y|x, w) \\ &= log(1 + \sum_i^{K-1}\exp(x w_i)) - (1-\alpha(y)) x w_{y-1} \\ &= log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) margins_{y-1} \end{align} $$

where $\alpha(i) = 1$ if $i \ne 0$, and $\alpha(i) = 0$ if $i == 0$, $margins_i = x w_i$.
For optimization, we have to calculate the first derivative of the loss function, and a simple calculation shows that

$$ \begin{align} \frac{\partial l(w, x)}{\partial w_{ij}} &= (\exp(x w_i) / (1 + \sum_k^{K-1} \exp(x w_k)) - (1-\alpha(y)\delta_{y, i+1})) * x_j \\ &= multiplier_i * x_j \end{align} $$

where $\delta_{i, j} = 1$ if $i == j$, $\delta_{i, j} = 0$ if $i != j$, and multiplier = $\exp(margins_i) / (1 + \sum_k^{K-1} \exp(margins_i)) - (1-\alpha(y)\delta_{y, i+1})$
If any of margins is larger than 709.78, the numerical computation of multiplier and loss function will be suffered from arithmetic overflow. This issue occurs when there are outliers in data which are far away from hyperplane, and this will cause the failing of training once infinity / infinity is introduced. Note that this is only a concern when max(margins) > 0.
Fortunately, when max(margins) = maxMargin > 0, the loss function and the multiplier can be easily rewritten into the following equivalent numerically stable formula.

$$ \begin{align} l(w, x) &= log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) margins_{y-1} \\ &= log(\exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin)) + maxMargin - (1-\alpha(y)) margins_{y-1} \\ &= log(1 + sum) + maxMargin - (1-\alpha(y)) margins_{y-1} \end{align} $$

where sum = $\exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin) - 1$.
Note that each term, $(margins_i - maxMargin)$ in $\exp$ is smaller than zero; as a result, overflow will not happen with this formula.
For multiplier, similar trick can be applied as the following,

$$ \begin{align} multiplier &= \exp(margins_i) / (1 + \sum_k^{K-1} \exp(margins_i)) - (1-\alpha(y)\delta_{y, i+1}) \\ &= \exp(margins_i - maxMargin) / (1 + sum) - (1-\alpha(y)\delta_{y, i+1}) \end{align} $$

where each term in $\exp$ is also smaller than zero, so overflow is not a concern.
For the detailed mathematical derivation, see the reference at http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297
param: numClasses the number of possible outcomes for k classes classification problem in Multinomial Logistic Regression. By default, it is binary logistic regression so numClasses will be set to 2.

See Also:
Serialized Form

Constructor Summary

Constructors
Constructor and Description

LogisticGradient()

LogisticGradient(int numClasses)

Constructors
Constructor and Description
`LogisticGradient()`
`LogisticGradient(int numClasses)`

Method Summary

Methods
Modifier and Type	Method and Description
`double`	`compute(Vector data, double label, Vector weights, Vector cumGradient)` Compute the gradient and loss given the features of a single data point, add the gradient to a provided vector to avoid creating new objects, and return loss.

Methods inherited from class org.apache.spark.mllib.optimization.Gradient
compute

Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - LogisticGradient
```
public LogisticGradient(int numClasses)
```
  - LogisticGradient
```
public LogisticGradient()
```
- Method Detail
  - compute
```
public double compute(Vector data,
             double label,
             Vector weights,
             Vector cumGradient)
```
    Description copied from class: Gradient
    
    Compute the gradient and loss given the features of a single data point, add the gradient to a provided vector to avoid creating new objects, and return loss.
    
    Specified by:
    
    compute in class Gradient
    
    Parameters:
    data - features for one data point
    label - label for this data point
    weights - weights/coefficients corresponding to features
    cumGradient - the computed gradient will be added to this vector
    
    Returns:
    loss

Class LogisticGradient

Constructor Summary

Method Summary

Methods inherited from class org.apache.spark.mllib.optimization.Gradient

Methods inherited from class Object

Constructor Detail

LogisticGradient

LogisticGradient

Method Detail

compute