public class LinearRegression extends Regressor<Vector,LinearRegression,LinearRegressionModel> implements LinearRegressionParams, DefaultParamsWritable, org.apache.spark.internal.Logging
The learning objective is to minimize the specified loss function, with regularization. This supports two kinds of loss: - squaredError (a.k.a squared loss) - huber (a hybrid of squared error for relatively small errors and absolute error for relatively large ones, and we estimate the scale parameter from training data)
This supports multiple types of regularization: - none (a.k.a. ordinary least squares) - L2 (ridge regression) - L1 (Lasso) - L2 + L1 (elastic net)
The squared error objective function is:
$$ \begin{align} \min_{w}\frac{1}{2n}{\sum_{i=1}^n(X_{i}w - y_{i})^{2} + \lambda\left[\frac{1-\alpha}{2}{||w||_{2}}^{2} + \alpha{||w||_{1}}\right]} \end{align} $$
The huber objective function is:
$$ \begin{align} \min_{w, \sigma}\frac{1}{2n}{\sum_{i=1}^n\left(\sigma + H_m\left(\frac{X_{i}w - y_{i}}{\sigma}\right)\sigma\right) + \frac{1}{2}\lambda {||w||_2}^2} \end{align} $$
where
$$ \begin{align} H_m(z) = \begin{cases} z^2, & \text {if } |z| < \epsilon, \\ 2\epsilon|z| - \epsilon^2, & \text{otherwise} \end{cases} \end{align} $$
Since 3.1.0, it supports stacking instances into blocks and using GEMV for better performance. The block size will be 1.0 MB, if param maxBlockSizeInMB is set 0.0 by default.
Note: Fitting with huber loss only supports none and L2 regularization.
Constructor and Description |
---|
LinearRegression() |
LinearRegression(String uid) |
Modifier and Type | Method and Description |
---|---|
IntParam |
aggregationDepth()
Param for suggested depth for treeAggregate (>= 2).
|
LinearRegression |
copy(ParamMap extra)
Creates a copy of this instance with the same UID and some extra params.
|
DoubleParam |
elasticNetParam()
Param for the ElasticNet mixing parameter, in range [0, 1].
|
DoubleParam |
epsilon()
The shape parameter to control the amount of robustness.
|
BooleanParam |
fitIntercept()
Param for whether to fit an intercept term.
|
static LinearRegression |
load(String path) |
Param<String> |
loss()
The loss function to be optimized.
|
static int |
MAX_FEATURES_FOR_NORMAL_SOLVER()
When using
LinearRegression.solver == "normal", the solver must limit the number of
features to at most this number. |
DoubleParam |
maxBlockSizeInMB()
Param for Maximum memory in MB for stacking input data into blocks.
|
IntParam |
maxIter()
Param for maximum number of iterations (>= 0).
|
static MLReader<T> |
read() |
DoubleParam |
regParam()
Param for regularization parameter (>= 0).
|
LinearRegression |
setAggregationDepth(int value)
Suggested depth for treeAggregate (greater than or equal to 2).
|
LinearRegression |
setElasticNetParam(double value)
Set the ElasticNet mixing parameter.
|
LinearRegression |
setEpsilon(double value)
Sets the value of param
epsilon . |
LinearRegression |
setFitIntercept(boolean value)
Set if we should fit the intercept.
|
LinearRegression |
setLoss(String value)
Sets the value of param
loss . |
LinearRegression |
setMaxBlockSizeInMB(double value)
Sets the value of param
maxBlockSizeInMB . |
LinearRegression |
setMaxIter(int value)
Set the maximum number of iterations.
|
LinearRegression |
setRegParam(double value)
Set the regularization parameter.
|
LinearRegression |
setSolver(String value)
Set the solver algorithm used for optimization.
|
LinearRegression |
setStandardization(boolean value)
Whether to standardize the training features before fitting the model.
|
LinearRegression |
setTol(double value)
Set the convergence tolerance of iterations.
|
LinearRegression |
setWeightCol(String value)
Whether to over-/under-sample training instances according to the given weights in weightCol.
|
Param<String> |
solver()
The solver algorithm for optimization.
|
BooleanParam |
standardization()
Param for whether to standardize the training features before fitting the model.
|
DoubleParam |
tol()
Param for the convergence tolerance for iterative algorithms (>= 0).
|
String |
uid()
An immutable unique ID for the object and its derivatives.
|
Param<String> |
weightCol()
Param for weight column name.
|
featuresCol, fit, labelCol, predictionCol, setFeaturesCol, setLabelCol, setPredictionCol, transformSchema
params
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getEpsilon, validateAndTransformSchema
getLabelCol, labelCol
featuresCol, getFeaturesCol
getPredictionCol, predictionCol
clear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn
toString
getRegParam
getElasticNetParam
getMaxIter
getFitIntercept
getStandardization
getWeightCol
getAggregationDepth
getMaxBlockSizeInMB
write
save
$init$, initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, initLock, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log__$eq, org$apache$spark$internal$Logging$$log_, uninitialize
public LinearRegression(String uid)
public LinearRegression()
public static LinearRegression load(String path)
public static int MAX_FEATURES_FOR_NORMAL_SOLVER()
LinearRegression.solver
== "normal", the solver must limit the number of
features to at most this number. The entire covariance matrix X^T^X will be collected
to the driver. This limit helps prevent memory overflow errors.public static MLReader<T> read()
public final Param<String> solver()
LinearRegressionParams
solver
in interface HasSolver
solver
in interface LinearRegressionParams
public final Param<String> loss()
LinearRegressionParams
loss
in interface HasLoss
loss
in interface LinearRegressionParams
public final DoubleParam epsilon()
LinearRegressionParams
epsilon
in interface LinearRegressionParams
public final DoubleParam maxBlockSizeInMB()
HasMaxBlockSizeInMB
maxBlockSizeInMB
in interface HasMaxBlockSizeInMB
public final IntParam aggregationDepth()
HasAggregationDepth
aggregationDepth
in interface HasAggregationDepth
public final Param<String> weightCol()
HasWeightCol
weightCol
in interface HasWeightCol
public final BooleanParam standardization()
HasStandardization
standardization
in interface HasStandardization
public final BooleanParam fitIntercept()
HasFitIntercept
fitIntercept
in interface HasFitIntercept
public final DoubleParam tol()
HasTol
public final IntParam maxIter()
HasMaxIter
maxIter
in interface HasMaxIter
public final DoubleParam elasticNetParam()
HasElasticNetParam
elasticNetParam
in interface HasElasticNetParam
public final DoubleParam regParam()
HasRegParam
regParam
in interface HasRegParam
public String uid()
Identifiable
uid
in interface Identifiable
public LinearRegression setRegParam(double value)
value
- (undocumented)public LinearRegression setFitIntercept(boolean value)
value
- (undocumented)public LinearRegression setStandardization(boolean value)
value
- (undocumented)public LinearRegression setElasticNetParam(double value)
Note: Fitting with huber loss only supports None and L2 regularization, so throws exception if this param is non-zero value.
value
- (undocumented)public LinearRegression setMaxIter(int value)
value
- (undocumented)public LinearRegression setTol(double value)
value
- (undocumented)public LinearRegression setWeightCol(String value)
value
- (undocumented)public LinearRegression setSolver(String value)
LinearRegression.MAX_FEATURES_FOR_NORMAL_SOLVER
.
- "auto" (default) means that the solver algorithm is selected automatically.
The Normal Equations solver will be used when possible, but this will automatically fall
back to iterative optimization methods when needed.
Note: Fitting with huber loss doesn't support normal solver, so throws exception if this param was set with "normal".
value
- (undocumented)public LinearRegression setAggregationDepth(int value)
value
- (undocumented)public LinearRegression setLoss(String value)
loss
.
Default is "squaredError".
value
- (undocumented)public LinearRegression setEpsilon(double value)
epsilon
.
Default is 1.35.
value
- (undocumented)public LinearRegression setMaxBlockSizeInMB(double value)
maxBlockSizeInMB
.
Default is 0.0, then 1.0 MB will be chosen.
value
- (undocumented)public LinearRegression copy(ParamMap extra)
Params
defaultCopy()
.copy
in interface Params
copy
in class Predictor<Vector,LinearRegression,LinearRegressionModel>
extra
- (undocumented)