Class LogisticRegression

All Implemented Interfaces:
Serializable, org.apache.spark.internal.Logging, ClassifierParams, LogisticRegressionParams, ProbabilisticClassifierParams, Params, HasAggregationDepth, HasElasticNetParam, HasFeaturesCol, HasFitIntercept, HasLabelCol, HasMaxBlockSizeInMB, HasMaxIter, HasPredictionCol, HasProbabilityCol, HasRawPredictionCol, HasRegParam, HasStandardization, HasThreshold, HasThresholds, HasTol, HasWeightCol, PredictorParams, DefaultParamsWritable, Identifiable, MLWritable

public class LogisticRegression extends ProbabilisticClassifier<Vector,LogisticRegression,LogisticRegressionModel> implements LogisticRegressionParams, DefaultParamsWritable, org.apache.spark.internal.Logging
Logistic regression. Supports: - Multinomial logistic (softmax) regression. - Binomial logistic regression.

This class supports fitting traditional logistic regression model by LBFGS/OWLQN and bound (box) constrained logistic regression model by LBFGSB.

Since 3.1.0, it supports stacking instances into blocks and using GEMV/GEMM for better performance. The block size will be 1.0 MB, if param maxBlockSizeInMB is set 0.0 by default.

See Also:
  • Constructor Details

    • LogisticRegression

      public LogisticRegression(String uid)
    • LogisticRegression

      public LogisticRegression()
  • Method Details

    • load

      public static LogisticRegression load(String path)
    • read

      public static MLReader<T> read()
    • family

      public final Param<String> family()
      Description copied from interface: LogisticRegressionParams
      Param for the name of family which is a description of the label distribution to be used in the model. Supported options: - "auto": Automatically select the family based on the number of classes: If numClasses == 1 || numClasses == 2, set to "binomial". Else, set to "multinomial" - "binomial": Binary logistic regression with pivoting. - "multinomial": Multinomial logistic (softmax) regression without pivoting. Default is "auto".

      Specified by:
      family in interface LogisticRegressionParams
      Returns:
      (undocumented)
    • lowerBoundsOnCoefficients

      public Param<Matrix> lowerBoundsOnCoefficients()
      Description copied from interface: LogisticRegressionParams
      The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. Otherwise, it throws exception. Default is none.

      Specified by:
      lowerBoundsOnCoefficients in interface LogisticRegressionParams
      Returns:
      (undocumented)
    • upperBoundsOnCoefficients

      public Param<Matrix> upperBoundsOnCoefficients()
      Description copied from interface: LogisticRegressionParams
      The upper bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. Otherwise, it throws exception. Default is none.

      Specified by:
      upperBoundsOnCoefficients in interface LogisticRegressionParams
      Returns:
      (undocumented)
    • lowerBoundsOnIntercepts

      public Param<Vector> lowerBoundsOnIntercepts()
      Description copied from interface: LogisticRegressionParams
      The lower bounds on intercepts if fitting under bound constrained optimization. The bounds vector size must be equal to 1 for binomial regression, or the number of classes for multinomial regression. Otherwise, it throws exception. Default is none.

      Specified by:
      lowerBoundsOnIntercepts in interface LogisticRegressionParams
      Returns:
      (undocumented)
    • upperBoundsOnIntercepts

      public Param<Vector> upperBoundsOnIntercepts()
      Description copied from interface: LogisticRegressionParams
      The upper bounds on intercepts if fitting under bound constrained optimization. The bound vector size must be equal to 1 for binomial regression, or the number of classes for multinomial regression. Otherwise, it throws exception. Default is none.

      Specified by:
      upperBoundsOnIntercepts in interface LogisticRegressionParams
      Returns:
      (undocumented)
    • maxBlockSizeInMB

      public final DoubleParam maxBlockSizeInMB()
      Description copied from interface: HasMaxBlockSizeInMB
      Param for Maximum memory in MB for stacking input data into blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. Default 0.0 represents choosing optimal value, depends on specific algorithm. Must be &gt;= 0..
      Specified by:
      maxBlockSizeInMB in interface HasMaxBlockSizeInMB
      Returns:
      (undocumented)
    • aggregationDepth

      public final IntParam aggregationDepth()
      Description copied from interface: HasAggregationDepth
      Param for suggested depth for treeAggregate (&gt;= 2).
      Specified by:
      aggregationDepth in interface HasAggregationDepth
      Returns:
      (undocumented)
    • threshold

      public DoubleParam threshold()
      Description copied from interface: HasThreshold
      Param for threshold in binary classification prediction, in range [0, 1].
      Specified by:
      threshold in interface HasThreshold
      Returns:
      (undocumented)
    • weightCol

      public final Param<String> weightCol()
      Description copied from interface: HasWeightCol
      Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.
      Specified by:
      weightCol in interface HasWeightCol
      Returns:
      (undocumented)
    • standardization

      public final BooleanParam standardization()
      Description copied from interface: HasStandardization
      Param for whether to standardize the training features before fitting the model.
      Specified by:
      standardization in interface HasStandardization
      Returns:
      (undocumented)
    • tol

      public final DoubleParam tol()
      Description copied from interface: HasTol
      Param for the convergence tolerance for iterative algorithms (&gt;= 0).
      Specified by:
      tol in interface HasTol
      Returns:
      (undocumented)
    • fitIntercept

      public final BooleanParam fitIntercept()
      Description copied from interface: HasFitIntercept
      Param for whether to fit an intercept term.
      Specified by:
      fitIntercept in interface HasFitIntercept
      Returns:
      (undocumented)
    • maxIter

      public final IntParam maxIter()
      Description copied from interface: HasMaxIter
      Param for maximum number of iterations (&gt;= 0).
      Specified by:
      maxIter in interface HasMaxIter
      Returns:
      (undocumented)
    • elasticNetParam

      public final DoubleParam elasticNetParam()
      Description copied from interface: HasElasticNetParam
      Param for the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.
      Specified by:
      elasticNetParam in interface HasElasticNetParam
      Returns:
      (undocumented)
    • regParam

      public final DoubleParam regParam()
      Description copied from interface: HasRegParam
      Param for regularization parameter (&gt;= 0).
      Specified by:
      regParam in interface HasRegParam
      Returns:
      (undocumented)
    • uid

      public String uid()
      Description copied from interface: Identifiable
      An immutable unique ID for the object and its derivatives.
      Specified by:
      uid in interface Identifiable
      Returns:
      (undocumented)
    • setRegParam

      public LogisticRegression setRegParam(double value)
      Set the regularization parameter. Default is 0.0.

      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • setElasticNetParam

      public LogisticRegression setElasticNetParam(double value)
      Set the ElasticNet mixing parameter. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. For alpha in (0,1), the penalty is a combination of L1 and L2. Default is 0.0 which is an L2 penalty.

      Note: Fitting under bound constrained optimization only supports L2 regularization, so throws exception if this param is non-zero value.

      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • setMaxIter

      public LogisticRegression setMaxIter(int value)
      Set the maximum number of iterations. Default is 100.

      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • setTol

      public LogisticRegression setTol(double value)
      Set the convergence tolerance of iterations. Smaller value will lead to higher accuracy at the cost of more iterations. Default is 1E-6.

      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • setFitIntercept

      public LogisticRegression setFitIntercept(boolean value)
      Whether to fit an intercept term. Default is true.

      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • setFamily

      public LogisticRegression setFamily(String value)
      Sets the value of param family(). Default is "auto".

      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • setStandardization

      public LogisticRegression setStandardization(boolean value)
      Whether to standardize the training features before fitting the model. The coefficients of models will be always returned on the original scale, so it will be transparent for users. Note that with/without standardization, the models should be always converged to the same solution when no regularization is applied. In R's GLMNET package, the default behavior is true as well. Default is true.

      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • setThreshold

      public LogisticRegression setThreshold(double value)
      Description copied from interface: LogisticRegressionParams
      Set threshold in binary classification, in range [0, 1].

      If the estimated probability of class label 1 is greater than threshold, then predict 1, else 0. A high threshold encourages the model to predict 0 more often; a low threshold encourages the model to predict 1 more often.

      Note: Calling this with threshold p is equivalent to calling setThresholds(Array(1-p, p)). When setThreshold() is called, any user-set value for thresholds will be cleared. If both threshold and thresholds are set in a ParamMap, then they must be equivalent.

      Default is 0.5.

      Specified by:
      setThreshold in interface LogisticRegressionParams
      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • getThreshold

      public double getThreshold()
      Description copied from interface: LogisticRegressionParams
      Get threshold for binary classification.

      If thresholds is set with length 2 (i.e., binary classification), this returns the equivalent threshold:

      1 / (1 + thresholds(0) / thresholds(1))
      . Otherwise, returns `threshold` if set, or its default value if unset. @group getParam @throws IllegalArgumentException if `thresholds` is set to an array of length other than 2.
      Specified by:
      getThreshold in interface HasThreshold
      Specified by:
      getThreshold in interface LogisticRegressionParams
      Returns:
      (undocumented)
    • setWeightCol

      public LogisticRegression setWeightCol(String value)
      Sets the value of param weightCol(). If this is not set or empty, we treat all instance weights as 1.0. Default is not set, so all instances have weight one.

      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • setThresholds

      public LogisticRegression setThresholds(double[] value)
      Description copied from interface: LogisticRegressionParams
      Set thresholds in multiclass (or binary) classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values greater than 0, excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.

      Note: When setThresholds() is called, any user-set value for threshold will be cleared. If both threshold and thresholds are set in a ParamMap, then they must be equivalent.

      Specified by:
      setThresholds in interface LogisticRegressionParams
      Overrides:
      setThresholds in class ProbabilisticClassifier<Vector,LogisticRegression,LogisticRegressionModel>
      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • getThresholds

      public double[] getThresholds()
      Description copied from interface: LogisticRegressionParams
      Get thresholds for binary or multiclass classification.

      If thresholds is set, return its value. Otherwise, if threshold is set, return the equivalent thresholds for binary classification: (1-threshold, threshold). If neither are set, throw an exception.

      Specified by:
      getThresholds in interface HasThresholds
      Specified by:
      getThresholds in interface LogisticRegressionParams
      Returns:
      (undocumented)
    • setAggregationDepth

      public LogisticRegression setAggregationDepth(int value)
      Suggested depth for treeAggregate (greater than or equal to 2). If the dimensions of features or the number of partitions are large, this param could be adjusted to a larger size. Default is 2.

      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • setLowerBoundsOnCoefficients

      public LogisticRegression setLowerBoundsOnCoefficients(Matrix value)
      Set the lower bounds on coefficients if fitting under bound constrained optimization.

      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • setUpperBoundsOnCoefficients

      public LogisticRegression setUpperBoundsOnCoefficients(Matrix value)
      Set the upper bounds on coefficients if fitting under bound constrained optimization.

      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • setLowerBoundsOnIntercepts

      public LogisticRegression setLowerBoundsOnIntercepts(Vector value)
      Set the lower bounds on intercepts if fitting under bound constrained optimization.

      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • setUpperBoundsOnIntercepts

      public LogisticRegression setUpperBoundsOnIntercepts(Vector value)
      Set the upper bounds on intercepts if fitting under bound constrained optimization.

      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • setMaxBlockSizeInMB

      public LogisticRegression setMaxBlockSizeInMB(double value)
      Sets the value of param maxBlockSizeInMB(). Default is 0.0, then 1.0 MB will be chosen.

      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • setInitialModel

      public LogisticRegression setInitialModel(LogisticRegressionModel model)
    • copy

      public LogisticRegression copy(ParamMap extra)
      Description copied from interface: Params
      Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See defaultCopy().
      Specified by:
      copy in interface Params
      Specified by:
      copy in class Predictor<Vector,LogisticRegression,LogisticRegressionModel>
      Parameters:
      extra - (undocumented)
      Returns:
      (undocumented)