Package org.apache.spark.ml.tree.impl
Class GradientBoostedTrees
Object
org.apache.spark.ml.tree.impl.GradientBoostedTrees
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic scala.Tuple2<DecisionTreeRegressionModel[],
double[]> boost
(RDD<org.apache.spark.ml.feature.Instance> input, RDD<org.apache.spark.ml.feature.Instance> validationInput, BoostingStrategy boostingStrategy, boolean validate, long seed, String featureSubsetStrategy, scala.Option<org.apache.spark.ml.util.Instrumentation> instr) Internal method for performing regression using trees as base learners.computeInitialPredictionAndError
(RDD<org.apache.spark.ml.tree.impl.TreePoint> data, double initTreeWeight, DecisionTreeRegressionModel initTree, Loss loss, Broadcast<Split[][]> bcSplits) Compute the initial predictions and errors for a dataset for the first iteration of gradient boosting.static double
computeWeightedError
(RDD<org.apache.spark.ml.feature.Instance> data, DecisionTreeRegressionModel[] trees, double[] treeWeights, Loss loss) Method to calculate error of the base learner for the gradient boosting calculation.static double
computeWeightedError
(RDD<org.apache.spark.ml.tree.impl.TreePoint> data, RDD<scala.Tuple2<Object, Object>> predError) Method to calculate error of the base learner for the gradient boosting calculation.static double[]
evaluateEachIteration
(RDD<org.apache.spark.ml.feature.Instance> data, DecisionTreeRegressionModel[] trees, double[] treeWeights, Loss loss, scala.Enumeration.Value algo) Method to compute error or loss for every iteration of gradient boosting.static org.apache.spark.internal.Logging.LogStringContext
LogStringContext
(scala.StringContext sc) static org.slf4j.Logger
static void
org$apache$spark$internal$Logging$$log__$eq
(org.slf4j.Logger x$1) static scala.Tuple2<DecisionTreeRegressionModel[],
double[]> run
(RDD<org.apache.spark.ml.feature.Instance> input, BoostingStrategy boostingStrategy, long seed, String featureSubsetStrategy, scala.Option<org.apache.spark.ml.util.Instrumentation> instr) Method to train a gradient boosting modelstatic scala.Tuple2<DecisionTreeRegressionModel[],
double[]> runWithValidation
(RDD<org.apache.spark.ml.feature.Instance> input, RDD<org.apache.spark.ml.feature.Instance> validationInput, BoostingStrategy boostingStrategy, long seed, String featureSubsetStrategy, scala.Option<org.apache.spark.ml.util.Instrumentation> instr) Method to validate a gradient boosting modelstatic double
updatePrediction
(Vector features, double prediction, DecisionTreeRegressionModel tree, double weight) Add prediction from a new boosting iteration to an existing prediction.static double
updatePrediction
(org.apache.spark.ml.tree.impl.TreePoint treePoint, double prediction, DecisionTreeRegressionModel tree, double weight, Split[][] splits) Add prediction from a new boosting iteration to an existing prediction.updatePredictionError
(RDD<org.apache.spark.ml.tree.impl.TreePoint> data, RDD<scala.Tuple2<Object, Object>> predictionAndError, double treeWeight, DecisionTreeRegressionModel tree, Loss loss, Broadcast<Split[][]> bcSplits) Update a zipped predictionError RDD (as obtained with computeInitialPredictionAndError)
-
Constructor Details
-
GradientBoostedTrees
public GradientBoostedTrees()
-
-
Method Details
-
run
public static scala.Tuple2<DecisionTreeRegressionModel[],double[]> run(RDD<org.apache.spark.ml.feature.Instance> input, BoostingStrategy boostingStrategy, long seed, String featureSubsetStrategy, scala.Option<org.apache.spark.ml.util.Instrumentation> instr) Method to train a gradient boosting model- Parameters:
input
- Training dataset: RDD ofInstance
.seed
- Random seed.boostingStrategy
- (undocumented)featureSubsetStrategy
- (undocumented)instr
- (undocumented)- Returns:
- tuple of ensemble models and weights: (array of decision tree models, array of model weights)
-
runWithValidation
public static scala.Tuple2<DecisionTreeRegressionModel[],double[]> runWithValidation(RDD<org.apache.spark.ml.feature.Instance> input, RDD<org.apache.spark.ml.feature.Instance> validationInput, BoostingStrategy boostingStrategy, long seed, String featureSubsetStrategy, scala.Option<org.apache.spark.ml.util.Instrumentation> instr) Method to validate a gradient boosting model- Parameters:
input
- Training dataset: RDD ofInstance
.validationInput
- Validation dataset. This dataset should be different from the training dataset, but it should follow the same distribution. E.g., these two datasets could be created from an original dataset by usingorg.apache.spark.rdd.RDD.randomSplit()
seed
- Random seed.boostingStrategy
- (undocumented)featureSubsetStrategy
- (undocumented)instr
- (undocumented)- Returns:
- tuple of ensemble models and weights: (array of decision tree models, array of model weights)
-
computeInitialPredictionAndError
public static RDD<scala.Tuple2<Object,Object>> computeInitialPredictionAndError(RDD<org.apache.spark.ml.tree.impl.TreePoint> data, double initTreeWeight, DecisionTreeRegressionModel initTree, Loss loss, Broadcast<Split[][]> bcSplits) Compute the initial predictions and errors for a dataset for the first iteration of gradient boosting.- Parameters:
data
- : training data.initTreeWeight
- : learning rate assigned to the first tree.initTree
- : first DecisionTreeModel.loss
- : evaluation metric.bcSplits
- (undocumented)- Returns:
- an RDD with each element being a zip of the prediction and error corresponding to every sample.
-
updatePredictionError
public static RDD<scala.Tuple2<Object,Object>> updatePredictionError(RDD<org.apache.spark.ml.tree.impl.TreePoint> data, RDD<scala.Tuple2<Object, Object>> predictionAndError, double treeWeight, DecisionTreeRegressionModel tree, Loss loss, Broadcast<Split[][]> bcSplits) Update a zipped predictionError RDD (as obtained with computeInitialPredictionAndError)- Parameters:
data
- : training data.predictionAndError
- : predictionError RDDtreeWeight
- : Learning rate.tree
- : Tree using which the prediction and error should be updated.loss
- : evaluation metric.bcSplits
- (undocumented)- Returns:
- an RDD with each element being a zip of the prediction and error corresponding to each sample.
-
updatePrediction
public static double updatePrediction(org.apache.spark.ml.tree.impl.TreePoint treePoint, double prediction, DecisionTreeRegressionModel tree, double weight, Split[][] splits) Add prediction from a new boosting iteration to an existing prediction.- Parameters:
treePoint
- Binned vector of features representing a single data point.prediction
- The existing prediction.tree
- New Decision Tree model.weight
- Tree weight.splits
- (undocumented)- Returns:
- Updated prediction.
-
updatePrediction
public static double updatePrediction(Vector features, double prediction, DecisionTreeRegressionModel tree, double weight) Add prediction from a new boosting iteration to an existing prediction.- Parameters:
features
- Vector of features representing a single data point.prediction
- The existing prediction.tree
- New Decision Tree model.weight
- Tree weight.- Returns:
- Updated prediction.
-
computeWeightedError
public static double computeWeightedError(RDD<org.apache.spark.ml.feature.Instance> data, DecisionTreeRegressionModel[] trees, double[] treeWeights, Loss loss) Method to calculate error of the base learner for the gradient boosting calculation. Note: This method is not used by the gradient boosting algorithm but is useful for debugging purposes.- Parameters:
data
- Training dataset: RDD ofInstance
.trees
- Boosted Decision Tree modelstreeWeights
- Learning rates at each boosting iteration.loss
- evaluation metric.- Returns:
- Measure of model error on data
-
computeWeightedError
public static double computeWeightedError(RDD<org.apache.spark.ml.tree.impl.TreePoint> data, RDD<scala.Tuple2<Object, Object>> predError) Method to calculate error of the base learner for the gradient boosting calculation.- Parameters:
data
- Training dataset: RDD ofTreePoint
.predError
- Prediction and error.- Returns:
- Measure of model error on data
-
evaluateEachIteration
public static double[] evaluateEachIteration(RDD<org.apache.spark.ml.feature.Instance> data, DecisionTreeRegressionModel[] trees, double[] treeWeights, Loss loss, scala.Enumeration.Value algo) Method to compute error or loss for every iteration of gradient boosting.- Parameters:
data
- RDD ofInstance
trees
- Boosted Decision Tree modelstreeWeights
- Learning rates at each boosting iteration.loss
- evaluation metric.algo
- algorithm for the ensemble, either Classification or Regression- Returns:
- an array with index i having the losses or errors for the ensemble containing the first i+1 trees
-
boost
public static scala.Tuple2<DecisionTreeRegressionModel[],double[]> boost(RDD<org.apache.spark.ml.feature.Instance> input, RDD<org.apache.spark.ml.feature.Instance> validationInput, BoostingStrategy boostingStrategy, boolean validate, long seed, String featureSubsetStrategy, scala.Option<org.apache.spark.ml.util.Instrumentation> instr) Internal method for performing regression using trees as base learners.- Parameters:
input
- training datasetvalidationInput
- validation dataset, ignored if validate is set to false.boostingStrategy
- boosting parametersvalidate
- whether or not to use the validation dataset.seed
- Random seed.featureSubsetStrategy
- (undocumented)instr
- (undocumented)- Returns:
- tuple of ensemble models and weights: (array of decision tree models, array of model weights)
-
org$apache$spark$internal$Logging$$log_
public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_() -
org$apache$spark$internal$Logging$$log__$eq
public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1) -
LogStringContext
public static org.apache.spark.internal.Logging.LogStringContext LogStringContext(scala.StringContext sc)
-