GradientBoostedTrees¶
-
class
pyspark.mllib.tree.
GradientBoostedTrees
[source]¶ Learning algorithm for a gradient boosted trees model for classification or regression.
New in version 1.3.0.
Methods
trainClassifier
(data, categoricalFeaturesInfo)Train a gradient-boosted trees model for classification.
trainRegressor
(data, categoricalFeaturesInfo)Train a gradient-boosted trees model for regression.
Methods Documentation
-
classmethod
trainClassifier
(data: pyspark.rdd.RDD[pyspark.mllib.regression.LabeledPoint], categoricalFeaturesInfo: Dict[int, int], loss: str = 'logLoss', numIterations: int = 100, learningRate: float = 0.1, maxDepth: int = 3, maxBins: int = 32) → pyspark.mllib.tree.GradientBoostedTreesModel[source]¶ Train a gradient-boosted trees model for classification.
New in version 1.3.0.
- Parameters
- data
pyspark.RDD
Training dataset: RDD of LabeledPoint. Labels should take values {0, 1}.
- categoricalFeaturesInfodict
Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, …, k-1}.
- lossstr, optional
Loss function used for minimization during gradient boosting. Supported values: “logLoss”, “leastSquaresError”, “leastAbsoluteError”. (default: “logLoss”)
- numIterationsint, optional
Number of iterations of boosting. (default: 100)
- learningRatefloat, optional
Learning rate for shrinking the contribution of each estimator. The learning rate should be between in the interval (0, 1]. (default: 0.1)
- maxDepthint, optional
Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 3)
- maxBinsint, optional
Maximum number of bins used for splitting features. DecisionTree requires maxBins >= max categories. (default: 32)
- data
- Returns
GradientBoostedTreesModel
that can be used for prediction.
Examples
>>> from pyspark.mllib.regression import LabeledPoint >>> from pyspark.mllib.tree import GradientBoostedTrees >>> >>> data = [ ... LabeledPoint(0.0, [0.0]), ... LabeledPoint(0.0, [1.0]), ... LabeledPoint(1.0, [2.0]), ... LabeledPoint(1.0, [3.0]) ... ] >>> >>> model = GradientBoostedTrees.trainClassifier(sc.parallelize(data), {}, numIterations=10) >>> model.numTrees() 10 >>> model.totalNumNodes() 30 >>> print(model) # it already has newline TreeEnsembleModel classifier with 10 trees >>> model.predict([2.0]) 1.0 >>> model.predict([0.0]) 0.0 >>> rdd = sc.parallelize([[2.0], [0.0]]) >>> model.predict(rdd).collect() [1.0, 0.0]
-
classmethod
trainRegressor
(data: pyspark.rdd.RDD[pyspark.mllib.regression.LabeledPoint], categoricalFeaturesInfo: Dict[int, int], loss: str = 'leastSquaresError', numIterations: int = 100, learningRate: float = 0.1, maxDepth: int = 3, maxBins: int = 32) → pyspark.mllib.tree.GradientBoostedTreesModel[source]¶ Train a gradient-boosted trees model for regression.
New in version 1.3.0.
- Parameters
- data :
Training dataset: RDD of LabeledPoint. Labels are real numbers.
- categoricalFeaturesInfodict
Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, …, k-1}.
- lossstr, optional
Loss function used for minimization during gradient boosting. Supported values: “logLoss”, “leastSquaresError”, “leastAbsoluteError”. (default: “leastSquaresError”)
- numIterationsint, optional
Number of iterations of boosting. (default: 100)
- learningRatefloat, optional
Learning rate for shrinking the contribution of each estimator. The learning rate should be between in the interval (0, 1]. (default: 0.1)
- maxDepthint, optional
Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 3)
- maxBinsint, optional
Maximum number of bins used for splitting features. DecisionTree requires maxBins >= max categories. (default: 32)
- Returns
GradientBoostedTreesModel
that can be used for prediction.
Examples
>>> from pyspark.mllib.regression import LabeledPoint >>> from pyspark.mllib.tree import GradientBoostedTrees >>> from pyspark.mllib.linalg import SparseVector >>> >>> sparse_data = [ ... LabeledPoint(0.0, SparseVector(2, {0: 1.0})), ... LabeledPoint(1.0, SparseVector(2, {1: 1.0})), ... LabeledPoint(0.0, SparseVector(2, {0: 1.0})), ... LabeledPoint(1.0, SparseVector(2, {1: 2.0})) ... ] >>> >>> data = sc.parallelize(sparse_data) >>> model = GradientBoostedTrees.trainRegressor(data, {}, numIterations=10) >>> model.numTrees() 10 >>> model.totalNumNodes() 12 >>> model.predict(SparseVector(2, {1: 1.0})) 1.0 >>> model.predict(SparseVector(2, {0: 1.0})) 0.0 >>> rdd = sc.parallelize([[0.0, 1.0], [1.0, 0.0]]) >>> model.predict(rdd).collect() [1.0, 0.0]
-
classmethod