Class GradientBoostedTrees

Object
org.apache.spark.ml.tree.impl.GradientBoostedTrees

public class GradientBoostedTrees extends Object
  • Constructor Details

    • GradientBoostedTrees

      public GradientBoostedTrees()
  • Method Details

    • run

      public static scala.Tuple2<DecisionTreeRegressionModel[],double[]> run(RDD<org.apache.spark.ml.feature.Instance> input, BoostingStrategy boostingStrategy, long seed, String featureSubsetStrategy, scala.Option<org.apache.spark.ml.util.Instrumentation> instr)
      Method to train a gradient boosting model
      Parameters:
      input - Training dataset: RDD of Instance.
      seed - Random seed.
      boostingStrategy - (undocumented)
      featureSubsetStrategy - (undocumented)
      instr - (undocumented)
      Returns:
      tuple of ensemble models and weights: (array of decision tree models, array of model weights)
    • runWithValidation

      public static scala.Tuple2<DecisionTreeRegressionModel[],double[]> runWithValidation(RDD<org.apache.spark.ml.feature.Instance> input, RDD<org.apache.spark.ml.feature.Instance> validationInput, BoostingStrategy boostingStrategy, long seed, String featureSubsetStrategy, scala.Option<org.apache.spark.ml.util.Instrumentation> instr)
      Method to validate a gradient boosting model
      Parameters:
      input - Training dataset: RDD of Instance.
      validationInput - Validation dataset. This dataset should be different from the training dataset, but it should follow the same distribution. E.g., these two datasets could be created from an original dataset by using org.apache.spark.rdd.RDD.randomSplit()
      seed - Random seed.
      boostingStrategy - (undocumented)
      featureSubsetStrategy - (undocumented)
      instr - (undocumented)
      Returns:
      tuple of ensemble models and weights: (array of decision tree models, array of model weights)
    • computeInitialPredictionAndError

      public static RDD<scala.Tuple2<Object,Object>> computeInitialPredictionAndError(RDD<org.apache.spark.ml.tree.impl.TreePoint> data, double initTreeWeight, DecisionTreeRegressionModel initTree, Loss loss, Broadcast<Split[][]> bcSplits)
      Compute the initial predictions and errors for a dataset for the first iteration of gradient boosting.
      Parameters:
      data - : training data.
      initTreeWeight - : learning rate assigned to the first tree.
      initTree - : first DecisionTreeModel.
      loss - : evaluation metric.
      bcSplits - (undocumented)
      Returns:
      an RDD with each element being a zip of the prediction and error corresponding to every sample.
    • updatePredictionError

      public static RDD<scala.Tuple2<Object,Object>> updatePredictionError(RDD<org.apache.spark.ml.tree.impl.TreePoint> data, RDD<scala.Tuple2<Object,Object>> predictionAndError, double treeWeight, DecisionTreeRegressionModel tree, Loss loss, Broadcast<Split[][]> bcSplits)
      Update a zipped predictionError RDD (as obtained with computeInitialPredictionAndError)
      Parameters:
      data - : training data.
      predictionAndError - : predictionError RDD
      treeWeight - : Learning rate.
      tree - : Tree using which the prediction and error should be updated.
      loss - : evaluation metric.
      bcSplits - (undocumented)
      Returns:
      an RDD with each element being a zip of the prediction and error corresponding to each sample.
    • updatePrediction

      public static double updatePrediction(org.apache.spark.ml.tree.impl.TreePoint treePoint, double prediction, DecisionTreeRegressionModel tree, double weight, Split[][] splits)
      Add prediction from a new boosting iteration to an existing prediction.

      Parameters:
      treePoint - Binned vector of features representing a single data point.
      prediction - The existing prediction.
      tree - New Decision Tree model.
      weight - Tree weight.
      splits - (undocumented)
      Returns:
      Updated prediction.
    • updatePrediction

      public static double updatePrediction(Vector features, double prediction, DecisionTreeRegressionModel tree, double weight)
      Add prediction from a new boosting iteration to an existing prediction.

      Parameters:
      features - Vector of features representing a single data point.
      prediction - The existing prediction.
      tree - New Decision Tree model.
      weight - Tree weight.
      Returns:
      Updated prediction.
    • computeWeightedError

      public static double computeWeightedError(RDD<org.apache.spark.ml.feature.Instance> data, DecisionTreeRegressionModel[] trees, double[] treeWeights, Loss loss)
      Method to calculate error of the base learner for the gradient boosting calculation. Note: This method is not used by the gradient boosting algorithm but is useful for debugging purposes.
      Parameters:
      data - Training dataset: RDD of Instance.
      trees - Boosted Decision Tree models
      treeWeights - Learning rates at each boosting iteration.
      loss - evaluation metric.
      Returns:
      Measure of model error on data
    • computeWeightedError

      public static double computeWeightedError(RDD<org.apache.spark.ml.tree.impl.TreePoint> data, RDD<scala.Tuple2<Object,Object>> predError)
      Method to calculate error of the base learner for the gradient boosting calculation.
      Parameters:
      data - Training dataset: RDD of TreePoint.
      predError - Prediction and error.
      Returns:
      Measure of model error on data
    • evaluateEachIteration

      public static double[] evaluateEachIteration(RDD<org.apache.spark.ml.feature.Instance> data, DecisionTreeRegressionModel[] trees, double[] treeWeights, Loss loss, scala.Enumeration.Value algo)
      Method to compute error or loss for every iteration of gradient boosting.

      Parameters:
      data - RDD of Instance
      trees - Boosted Decision Tree models
      treeWeights - Learning rates at each boosting iteration.
      loss - evaluation metric.
      algo - algorithm for the ensemble, either Classification or Regression
      Returns:
      an array with index i having the losses or errors for the ensemble containing the first i+1 trees
    • boost

      public static scala.Tuple2<DecisionTreeRegressionModel[],double[]> boost(RDD<org.apache.spark.ml.feature.Instance> input, RDD<org.apache.spark.ml.feature.Instance> validationInput, BoostingStrategy boostingStrategy, boolean validate, long seed, String featureSubsetStrategy, scala.Option<org.apache.spark.ml.util.Instrumentation> instr)
      Internal method for performing regression using trees as base learners.
      Parameters:
      input - training dataset
      validationInput - validation dataset, ignored if validate is set to false.
      boostingStrategy - boosting parameters
      validate - whether or not to use the validation dataset.
      seed - Random seed.
      featureSubsetStrategy - (undocumented)
      instr - (undocumented)
      Returns:
      tuple of ensemble models and weights: (array of decision tree models, array of model weights)
    • org$apache$spark$internal$Logging$$log_

      public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_()
    • org$apache$spark$internal$Logging$$log__$eq

      public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)
    • LogStringContext

      public static org.apache.spark.internal.Logging.LogStringContext LogStringContext(scala.StringContext sc)