Class RandomForestClassifier

All Implemented Interfaces:
Serializable, org.apache.spark.internal.Logging, ClassifierParams, ProbabilisticClassifierParams, Params, HasCheckpointInterval, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasProbabilityCol, HasRawPredictionCol, HasSeed, HasThresholds, HasWeightCol, PredictorParams, DecisionTreeParams, RandomForestClassifierParams, RandomForestParams, TreeClassifierParams, TreeEnsembleClassifierParams, TreeEnsembleParams, DefaultParamsWritable, Identifiable, MLWritable, scala.Serializable

Random Forest learning algorithm for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.
See Also:
  • Constructor Details

    • RandomForestClassifier

      public RandomForestClassifier(String uid)
    • RandomForestClassifier

      public RandomForestClassifier()
  • Method Details

    • supportedImpurities

      public static final String[] supportedImpurities()
      Accessor for supported impurity settings: entropy, gini
    • supportedFeatureSubsetStrategies

      public static final String[] supportedFeatureSubsetStrategies()
      Accessor for supported featureSubsetStrategy settings: auto, all, onethird, sqrt, log2
    • load

      public static RandomForestClassifier load(String path)
    • read

      public static MLReader<T> read()
    • impurity

      public final Param<String> impurity()
      Description copied from interface: TreeClassifierParams
      Criterion used for information gain calculation (case-insensitive). This impurity type is used in DecisionTreeClassifier and RandomForestClassifier, Supported: "entropy" and "gini". (default = gini)
      Specified by:
      impurity in interface TreeClassifierParams
      Returns:
      (undocumented)
    • numTrees

      public final IntParam numTrees()
      Description copied from interface: RandomForestParams
      Number of trees to train (at least 1). If 1, then no bootstrapping is used. If greater than 1, then bootstrapping is done. TODO: Change to always do bootstrapping (simpler). SPARK-7130 (default = 20)

      Note: The reason that we cannot add this to both GBT and RF (i.e. in TreeEnsembleParams) is the param maxIter controls how many trees a GBT has. The semantics in the algorithms are a bit different.

      Specified by:
      numTrees in interface RandomForestParams
      Returns:
      (undocumented)
    • bootstrap

      public final BooleanParam bootstrap()
      Description copied from interface: RandomForestParams
      Whether bootstrap samples are used when building trees.
      Specified by:
      bootstrap in interface RandomForestParams
      Returns:
      (undocumented)
    • subsamplingRate

      public final DoubleParam subsamplingRate()
      Description copied from interface: TreeEnsembleParams
      Fraction of the training data used for learning each decision tree, in range (0, 1]. (default = 1.0)
      Specified by:
      subsamplingRate in interface TreeEnsembleParams
      Returns:
      (undocumented)
    • featureSubsetStrategy

      public final Param<String> featureSubsetStrategy()
      Description copied from interface: TreeEnsembleParams
      The number of features to consider for splits at each tree node. Supported options: - "auto": Choose automatically for task: If numTrees == 1, set to "all." If numTrees greater than 1 (forest), set to "sqrt" for classification and to "onethird" for regression. - "all": use all features - "onethird": use 1/3 of the features - "sqrt": use sqrt(number of features) - "log2": use log2(number of features) - "n": when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features. (default = "auto")

      These various settings are based on the following references: - log2: tested in Breiman (2001) - sqrt: recommended by Breiman manual for random forests - The defaults of sqrt (classification) and onethird (regression) match the R randomForest package.

      Specified by:
      featureSubsetStrategy in interface TreeEnsembleParams
      Returns:
      (undocumented)
      See Also:
    • leafCol

      public final Param<String> leafCol()
      Description copied from interface: DecisionTreeParams
      Leaf indices column name. Predicted leaf index of each instance in each tree by preorder. (default = "")
      Specified by:
      leafCol in interface DecisionTreeParams
      Returns:
      (undocumented)
    • maxDepth

      public final IntParam maxDepth()
      Description copied from interface: DecisionTreeParams
      Maximum depth of the tree (nonnegative). E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (default = 5)
      Specified by:
      maxDepth in interface DecisionTreeParams
      Returns:
      (undocumented)
    • maxBins

      public final IntParam maxBins()
      Description copied from interface: DecisionTreeParams
      Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. Must be at least 2 and at least number of categories in any categorical feature. (default = 32)
      Specified by:
      maxBins in interface DecisionTreeParams
      Returns:
      (undocumented)
    • minInstancesPerNode

      public final IntParam minInstancesPerNode()
      Description copied from interface: DecisionTreeParams
      Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Must be at least 1. (default = 1)
      Specified by:
      minInstancesPerNode in interface DecisionTreeParams
      Returns:
      (undocumented)
    • minWeightFractionPerNode

      public final DoubleParam minWeightFractionPerNode()
      Description copied from interface: DecisionTreeParams
      Minimum fraction of the weighted sample count that each child must have after split. If a split causes the fraction of the total weight in the left or right child to be less than minWeightFractionPerNode, the split will be discarded as invalid. Should be in the interval [0.0, 0.5). (default = 0.0)
      Specified by:
      minWeightFractionPerNode in interface DecisionTreeParams
      Returns:
      (undocumented)
    • minInfoGain

      public final DoubleParam minInfoGain()
      Description copied from interface: DecisionTreeParams
      Minimum information gain for a split to be considered at a tree node. Should be at least 0.0. (default = 0.0)
      Specified by:
      minInfoGain in interface DecisionTreeParams
      Returns:
      (undocumented)
    • maxMemoryInMB

      public final IntParam maxMemoryInMB()
      Description copied from interface: DecisionTreeParams
      Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size. (default = 256 MB)
      Specified by:
      maxMemoryInMB in interface DecisionTreeParams
      Returns:
      (undocumented)
    • cacheNodeIds

      public final BooleanParam cacheNodeIds()
      Description copied from interface: DecisionTreeParams
      If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default = false)
      Specified by:
      cacheNodeIds in interface DecisionTreeParams
      Returns:
      (undocumented)
    • weightCol

      public final Param<String> weightCol()
      Description copied from interface: HasWeightCol
      Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.
      Specified by:
      weightCol in interface HasWeightCol
      Returns:
      (undocumented)
    • seed

      public final LongParam seed()
      Description copied from interface: HasSeed
      Param for random seed.
      Specified by:
      seed in interface HasSeed
      Returns:
      (undocumented)
    • checkpointInterval

      public final IntParam checkpointInterval()
      Description copied from interface: HasCheckpointInterval
      Param for set checkpoint interval (&gt;= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext.
      Specified by:
      checkpointInterval in interface HasCheckpointInterval
      Returns:
      (undocumented)
    • uid

      public String uid()
      Description copied from interface: Identifiable
      An immutable unique ID for the object and its derivatives.
      Specified by:
      uid in interface Identifiable
      Returns:
      (undocumented)
    • setMaxDepth

      public RandomForestClassifier setMaxDepth(int value)
    • setMaxBins

      public RandomForestClassifier setMaxBins(int value)
    • setMinInstancesPerNode

      public RandomForestClassifier setMinInstancesPerNode(int value)
    • setMinWeightFractionPerNode

      public RandomForestClassifier setMinWeightFractionPerNode(double value)
    • setMinInfoGain

      public RandomForestClassifier setMinInfoGain(double value)
    • setMaxMemoryInMB

      public RandomForestClassifier setMaxMemoryInMB(int value)
    • setCacheNodeIds

      public RandomForestClassifier setCacheNodeIds(boolean value)
    • setCheckpointInterval

      public RandomForestClassifier setCheckpointInterval(int value)
      Specifies how often to checkpoint the cached node IDs. E.g. 10 means that the cache will get checkpointed every 10 iterations. This is only used if cacheNodeIds is true and if the checkpoint directory is set in SparkContext. Must be at least 1. (default = 10)
      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • setImpurity

      public RandomForestClassifier setImpurity(String value)
    • setSubsamplingRate

      public RandomForestClassifier setSubsamplingRate(double value)
    • setSeed

      public RandomForestClassifier setSeed(long value)
    • setNumTrees

      public RandomForestClassifier setNumTrees(int value)
    • setBootstrap

      public RandomForestClassifier setBootstrap(boolean value)
    • setFeatureSubsetStrategy

      public RandomForestClassifier setFeatureSubsetStrategy(String value)
    • setWeightCol

      public RandomForestClassifier setWeightCol(String value)
      Sets the value of param weightCol(). If this is not set or empty, we treat all instance weights as 1.0. By default the weightCol is not set, so all instances have weight 1.0.

      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • copy

      public RandomForestClassifier copy(ParamMap extra)
      Description copied from interface: Params
      Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See defaultCopy().
      Specified by:
      copy in interface Params
      Specified by:
      copy in class Predictor<Vector,RandomForestClassifier,RandomForestClassificationModel>
      Parameters:
      extra - (undocumented)
      Returns:
      (undocumented)