org.apache.spark.ml.regression

Class LinearRegression

• All Implemented Interfaces:
java.io.Serializable, org.apache.spark.internal.Logging, Params, HasAggregationDepth, HasElasticNetParam, HasFeaturesCol, HasFitIntercept, HasLabelCol, HasLoss, HasMaxBlockSizeInMB, HasMaxIter, HasPredictionCol, HasRegParam, HasSolver, HasStandardization, HasTol, HasWeightCol, PredictorParams, LinearRegressionParams, DefaultParamsWritable, Identifiable, MLWritable

public class LinearRegression
extends Regressor<Vector,LinearRegression,LinearRegressionModel>
implements LinearRegressionParams, DefaultParamsWritable, org.apache.spark.internal.Logging
Linear regression.

The learning objective is to minimize the specified loss function, with regularization. This supports two kinds of loss: - squaredError (a.k.a squared loss) - huber (a hybrid of squared error for relatively small errors and absolute error for relatively large ones, and we estimate the scale parameter from training data)

This supports multiple types of regularization: - none (a.k.a. ordinary least squares) - L2 (ridge regression) - L1 (Lasso) - L2 + L1 (elastic net)

The squared error objective function is:

\begin{align} \min_{w}\frac{1}{2n}{\sum_{i=1}^n(X_{i}w - y_{i})^{2} + \lambda\left[\frac{1-\alpha}{2}{||w||_{2}}^{2} + \alpha{||w||_{1}}\right]} \end{align}

The huber objective function is:

\begin{align} \min_{w, \sigma}\frac{1}{2n}{\sum_{i=1}^n\left(\sigma + H_m\left(\frac{X_{i}w - y_{i}}{\sigma}\right)\sigma\right) + \frac{1}{2}\lambda {||w||_2}^2} \end{align}

where

\begin{align} H_m(z) = \begin{cases} z^2, & \text {if } |z| &lt; \epsilon, \\ 2\epsilon|z| - \epsilon^2, & \text{otherwise} \end{cases} \end{align}

Since 3.1.0, it supports stacking instances into blocks and using GEMV for better performance. The block size will be 1.0 MB, if param maxBlockSizeInMB is set 0.0 by default.

Note: Fitting with huber loss only supports none and L2 regularization.

Serialized Form
• Constructor Detail

• LinearRegression

public LinearRegression(String uid)
• LinearRegression

public LinearRegression()
• Method Detail

• MAX_FEATURES_FOR_NORMAL_SOLVER

public static int MAX_FEATURES_FOR_NORMAL_SOLVER()
When using LinearRegression.solver == "normal", the solver must limit the number of features to at most this number. The entire covariance matrix X^T^X will be collected to the driver. This limit helps prevent memory overflow errors.
Returns:
(undocumented)

• loss

public final Param<String> loss()
Description copied from interface: LinearRegressionParams
The loss function to be optimized. Supported options: "squaredError" and "huber". Default: "squaredError"

Specified by:
loss in interface HasLoss
Specified by:
loss in interface LinearRegressionParams
Returns:
(undocumented)
• epsilon

public final DoubleParam epsilon()
Description copied from interface: LinearRegressionParams
The shape parameter to control the amount of robustness. Must be &gt; 1.0. At larger values of epsilon, the huber criterion becomes more similar to least squares regression; for small values of epsilon, the criterion is more similar to L1 regression. Default is 1.35 to get as much robustness as possible while retaining 95% statistical efficiency for normally distributed data. It matches sklearn HuberRegressor and is "M" from A robust hybrid of lasso and ridge regression. Only valid when "loss" is "huber".

Specified by:
epsilon in interface LinearRegressionParams
Returns:
(undocumented)
• maxBlockSizeInMB

public final DoubleParam maxBlockSizeInMB()
Description copied from interface: HasMaxBlockSizeInMB
Param for Maximum memory in MB for stacking input data into blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. Default 0.0 represents choosing optimal value, depends on specific algorithm. Must be &gt;= 0..
Specified by:
maxBlockSizeInMB in interface HasMaxBlockSizeInMB
Returns:
(undocumented)
• weightCol

public final Param<String> weightCol()
Description copied from interface: HasWeightCol
Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.
Specified by:
weightCol in interface HasWeightCol
Returns:
(undocumented)
• tol

public final DoubleParam tol()
Description copied from interface: HasTol
Param for the convergence tolerance for iterative algorithms (&gt;= 0).
Specified by:
tol in interface HasTol
Returns:
(undocumented)
• maxIter

public final IntParam maxIter()
Description copied from interface: HasMaxIter
Param for maximum number of iterations (&gt;= 0).
Specified by:
maxIter in interface HasMaxIter
Returns:
(undocumented)
• elasticNetParam

public final DoubleParam elasticNetParam()
Description copied from interface: HasElasticNetParam
Param for the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.
Specified by:
elasticNetParam in interface HasElasticNetParam
Returns:
(undocumented)
• regParam

public final DoubleParam regParam()
Description copied from interface: HasRegParam
Param for regularization parameter (&gt;= 0).
Specified by:
regParam in interface HasRegParam
Returns:
(undocumented)
• uid

public String uid()
Description copied from interface: Identifiable
An immutable unique ID for the object and its derivatives.
Specified by:
uid in interface Identifiable
Returns:
(undocumented)
• setRegParam

public LinearRegression setRegParam(double value)
Set the regularization parameter. Default is 0.0.

Parameters:
value - (undocumented)
Returns:
(undocumented)
• setFitIntercept

public LinearRegression setFitIntercept(boolean value)
Set if we should fit the intercept. Default is true.

Parameters:
value - (undocumented)
Returns:
(undocumented)
• setStandardization

public LinearRegression setStandardization(boolean value)
Whether to standardize the training features before fitting the model. The coefficients of models will be always returned on the original scale, so it will be transparent for users. Default is true.

Parameters:
value - (undocumented)
Returns:
(undocumented)
Note:
With/without standardization, the models should be always converged to the same solution when no regularization is applied. In R's GLMNET package, the default behavior is true as well.

• setElasticNetParam

public LinearRegression setElasticNetParam(double value)
Set the ElasticNet mixing parameter. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. For alpha in (0,1), the penalty is a combination of L1 and L2. Default is 0.0 which is an L2 penalty.

Note: Fitting with huber loss only supports None and L2 regularization, so throws exception if this param is non-zero value.

Parameters:
value - (undocumented)
Returns:
(undocumented)
• setMaxIter

public LinearRegression setMaxIter(int value)
Set the maximum number of iterations. Default is 100.

Parameters:
value - (undocumented)
Returns:
(undocumented)
• setTol

public LinearRegression setTol(double value)
Set the convergence tolerance of iterations. Smaller value will lead to higher accuracy with the cost of more iterations. Default is 1E-6.

Parameters:
value - (undocumented)
Returns:
(undocumented)
• setWeightCol

public LinearRegression setWeightCol(String value)
Whether to over-/under-sample training instances according to the given weights in weightCol. If not set or empty, all instances are treated equally (weight 1.0). Default is not set, so all instances have weight one.

Parameters:
value - (undocumented)
Returns:
(undocumented)
• setSolver

public LinearRegression setSolver(String value)
Set the solver algorithm used for optimization. In case of linear regression, this can be "l-bfgs", "normal" and "auto". - "l-bfgs" denotes Limited-memory BFGS which is a limited-memory quasi-Newton optimization method. - "normal" denotes using Normal Equation as an analytical solution to the linear regression problem. This solver is limited to LinearRegression.MAX_FEATURES_FOR_NORMAL_SOLVER. - "auto" (default) means that the solver algorithm is selected automatically. The Normal Equations solver will be used when possible, but this will automatically fall back to iterative optimization methods when needed.

Note: Fitting with huber loss doesn't support normal solver, so throws exception if this param was set with "normal".

Parameters:
value - (undocumented)
Returns:
(undocumented)
• setAggregationDepth

public LinearRegression setAggregationDepth(int value)
Suggested depth for treeAggregate (greater than or equal to 2). If the dimensions of features or the number of partitions are large, this param could be adjusted to a larger size. Default is 2.

Parameters:
value - (undocumented)
Returns:
(undocumented)
• setLoss

public LinearRegression setLoss(String value)
Sets the value of param loss. Default is "squaredError".

Parameters:
value - (undocumented)
Returns:
(undocumented)
• setEpsilon

public LinearRegression setEpsilon(double value)
Sets the value of param epsilon. Default is 1.35.

Parameters:
value - (undocumented)
Returns:
(undocumented)
• setMaxBlockSizeInMB

public LinearRegression setMaxBlockSizeInMB(double value)
Sets the value of param maxBlockSizeInMB. Default is 0.0, then 1.0 MB will be chosen.

Parameters:
value - (undocumented)
Returns:
(undocumented)