package optimization
 Alphabetic
 Public
 All
Type Members

abstract
class
Gradient extends Serializable
Class used to compute the gradient for a loss function, given a single data point.

class
GradientDescent extends Optimizer with Logging
Class used to solve an optimization problem using Gradient Descent.

class
HingeGradient extends Gradient
Compute gradient and loss for a Hinge loss function, as used in SVM binary classification.
Compute gradient and loss for a Hinge loss function, as used in SVM binary classification. See also the documentation for the precise formulation.
 Note
This assumes that the labels are {0,1}

class
L1Updater extends Updater
Updater for L1 regularized problems.
Updater for L1 regularized problems. R(w) = w_1 Uses a stepsize decreasing with the square root of the number of iterations.
Instead of subgradient of the regularizer, the proximal operator for the L1 regularization is applied after the gradient step. This is known to result in better sparsity of the intermediate solution.
The corresponding proximal operator for the L1 norm is the softthresholding function. That is, each weight component is shrunk towards 0 by shrinkageVal.
If w is greater than shrinkageVal, set weight component to wshrinkageVal. If w is less than shrinkageVal, set weight component to w+shrinkageVal. If w is (shrinkageVal, shrinkageVal), set weight component to 0.
Equivalently, set weight component to signum(w) * max(0.0, abs(w)  shrinkageVal)

class
LBFGS extends Optimizer with Logging
Class used to solve an optimization problem using Limitedmemory BFGS.
Class used to solve an optimization problem using Limitedmemory BFGS. Reference: Wikipedia on Limitedmemory BFGS

class
LeastSquaresGradient extends Gradient
Compute gradient and loss for a Leastsquared loss function, as used in linear regression.
Compute gradient and loss for a Leastsquared loss function, as used in linear regression. This is correct for the averaged least squares loss function (mean squared error) L = 1/2n A weightsy^2 See also the documentation for the precise formulation.

class
LogisticGradient extends Gradient
Compute gradient and loss for a multinomial logistic loss function, as used in multiclass classification (it is also used in binary logistic regression).
Compute gradient and loss for a multinomial logistic loss function, as used in multiclass classification (it is also used in binary logistic regression).
In
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition
by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, which can be downloaded from http://statweb.stanford.edu/~tibs/ElemStatLearn/ , Eq. (4.17) on page 119 gives the formula of multinomial logistic regression model. A simple calculation shows that$$ P(y=0x, w) = 1 / (1 + \sum_i^{K1} \exp(x w_i))\\ P(y=1x, w) = exp(x w_1) / (1 + \sum_i^{K1} \exp(x w_i))\\ ...\\ P(y=K1x, w) = exp(x w_{K1}) / (1 + \sum_i^{K1} \exp(x w_i))\\ $$
for K classes multiclass classification problem.
The model weights \(w = (w_1, w_2, ..., w_{K1})^T\) becomes a matrix which has dimension of (K1) * (N+1) if the intercepts are added. If the intercepts are not added, the dimension will be (K1) * N.
As a result, the loss of objective function for a single instance of data can be written as
$$ \begin{align} l(w, x) &= log P(yx, w) = \alpha(y) log P(y=0x, w)  (1\alpha(y)) log P(yx, w) \\ &= log(1 + \sum_i^{K1}\exp(x w_i))  (1\alpha(y)) x w_{y1} \\ &= log(1 + \sum_i^{K1}\exp(margins_i))  (1\alpha(y)) margins_{y1} \end{align} $$
where $\alpha(i) = 1$ if \(i \ne 0\), and $\alpha(i) = 0$ if \(i == 0\), \(margins_i = x w_i\).
For optimization, we have to calculate the first derivative of the loss function, and a simple calculation shows that
$$ \begin{align} \frac{\partial l(w, x)}{\partial w_{ij}} &= (\exp(x w_i) / (1 + \sum_k^{K1} \exp(x w_k))  (1\alpha(y)\delta_{y, i+1})) * x_j \\ &= multiplier_i * x_j \end{align} $$
where $\delta_{i, j} = 1$ if \(i == j\), $\delta_{i, j} = 0$ if \(i != j\), and multiplier = $\exp(margins_i) / (1 + \sum_k^{K1} \exp(margins_i))  (1\alpha(y)\delta_{y, i+1})$
If any of margins is larger than 709.78, the numerical computation of multiplier and loss function will be suffered from arithmetic overflow. This issue occurs when there are outliers in data which are far away from hyperplane, and this will cause the failing of training once infinity / infinity is introduced. Note that this is only a concern when max(margins)
>
0.Fortunately, when max(margins) = maxMargin
>
0, the loss function and the multiplier can be easily rewritten into the following equivalent numerically stable formula.$$ \begin{align} l(w, x) &= log(1 + \sum_i^{K1}\exp(margins_i))  (1\alpha(y)) margins_{y1} \\ &= log(\exp(maxMargin) + \sum_i^{K1}\exp(margins_i  maxMargin)) + maxMargin  (1\alpha(y)) margins_{y1} \\ &= log(1 + sum) + maxMargin  (1\alpha(y)) margins_{y1} \end{align} $$
where sum = $\exp(maxMargin) + \sum_i^{K1}\exp(margins_i  maxMargin)  1$.
Note that each term, $(margins_i  maxMargin)$ in $\exp$ is smaller than zero; as a result, overflow will not happen with this formula.
For multiplier, similar trick can be applied as the following,
$$ \begin{align} multiplier &= \exp(margins_i) / (1 + \sum_k^{K1} \exp(margins_i))  (1\alpha(y)\delta_{y, i+1}) \\ &= \exp(margins_i  maxMargin) / (1 + sum)  (1\alpha(y)\delta_{y, i+1}) \end{align} $$
where each term in $\exp$ is also smaller than zero, so overflow is not a concern.
For the detailed mathematical derivation, see the reference at http://www.slideshare.net/dbtsai/20140620mlor36132297

trait
Optimizer extends Serializable
Trait for optimization problem solvers.

class
SimpleUpdater extends Updater
A simple updater for gradient descent *without* any regularization.
A simple updater for gradient descent *without* any regularization. Uses a stepsize decreasing with the square root of the number of iterations.

class
SquaredL2Updater extends Updater
Updater for L2 regularized problems.
Updater for L2 regularized problems. R(w) = 1/2 w^2 Uses a stepsize decreasing with the square root of the number of iterations.

abstract
class
Updater extends Serializable
Class used to perform steps (weight update) using Gradient Descent methods.
Class used to perform steps (weight update) using Gradient Descent methods.
For general minimization problems, or for regularized problems of the form min L(w) + regParam * R(w), the compute function performs the actual update step, when given some (e.g. stochastic) gradient direction for the loss L(w), and a desired stepsize (learning rate).
The updater is responsible to also perform the update coming from the regularization term R(w) (if any regularization is used).
Value Members

object
GradientDescent extends Logging with Serializable
Toplevel method to run gradient descent.

object
LBFGS extends Logging with Serializable
Toplevel method to run LBFGS.