spark.randomForest {SparkR}R Documentation

Random Forest Model for Regression and Classification

Description

spark.randomForest fits a Random Forest Regression model or Classification model on a SparkDataFrame. Users can call summary to get a summary of the fitted Random Forest model, predict to make predictions on new data, and write.ml/read.ml to save/load fitted models. For more details, see Random Forest Regression and Random Forest Classification

Usage

spark.randomForest(data, formula, ...)

## S4 method for signature 'SparkDataFrame,formula'
spark.randomForest(data, formula,
  type = c("regression", "classification"), maxDepth = 5, maxBins = 32,
  numTrees = 20, impurity = NULL, featureSubsetStrategy = "auto",
  seed = NULL, subsamplingRate = 1, minInstancesPerNode = 1,
  minInfoGain = 0, checkpointInterval = 10, maxMemoryInMB = 256,
  cacheNodeIds = FALSE)

## S4 method for signature 'RandomForestRegressionModel'
predict(object, newData)

## S4 method for signature 'RandomForestClassificationModel'
predict(object, newData)

## S4 method for signature 'RandomForestRegressionModel,character'
write.ml(object, path,
  overwrite = FALSE)

## S4 method for signature 'RandomForestClassificationModel,character'
write.ml(object, path,
  overwrite = FALSE)

## S4 method for signature 'RandomForestRegressionModel'
summary(object)

## S4 method for signature 'RandomForestClassificationModel'
summary(object)

## S3 method for class 'summary.RandomForestRegressionModel'
print(x, ...)

## S3 method for class 'summary.RandomForestClassificationModel'
print(x, ...)

Arguments

data

a SparkDataFrame for training.

formula

a symbolic description of the model to be fitted. Currently only a few formula operators are supported, including '~', ':', '+', and '-'.

...

additional arguments passed to the method.

type

type of model, one of "regression" or "classification", to fit

maxDepth

Maximum depth of the tree (>= 0).

maxBins

Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. Must be >= 2 and >= number of categories in any categorical feature.

numTrees

Number of trees to train (>= 1).

impurity

Criterion used for information gain calculation. For regression, must be "variance". For classification, must be one of "entropy" and "gini", default is "gini".

featureSubsetStrategy

The number of features to consider for splits at each tree node. Supported options: "auto", "all", "onethird", "sqrt", "log2", (0.0-1.0], [1-n].

seed

integer seed for random number generation.

subsamplingRate

Fraction of the training data used for learning each decision tree, in range (0, 1].

minInstancesPerNode

Minimum number of instances each child must have after split.

minInfoGain

Minimum information gain for a split to be considered at a tree node.

checkpointInterval

Param for set checkpoint interval (>= 1) or disable checkpoint (-1).

maxMemoryInMB

Maximum memory in MB allocated to histogram aggregation.

cacheNodeIds

If FALSE, the algorithm will pass trees to executors to match instances with nodes. If TRUE, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval.

object

A fitted Random Forest regression model or classification model.

newData

a SparkDataFrame for testing.

path

The directory where the model is saved.

overwrite

Overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists.

x

summary object of Random Forest regression model or classification model returned by summary.

Value

spark.randomForest returns a fitted Random Forest model.

predict returns a SparkDataFrame containing predicted labeled in a column named "prediction".

summary returns summary information of the fitted model, which is a list. The list of components includes formula (formula), numFeatures (number of features), features (list of features), featureImportances (feature importances), numTrees (number of trees), and treeWeights (tree weights).

Note

spark.randomForest since 2.1.0

predict(RandomForestRegressionModel) since 2.1.0

predict(RandomForestClassificationModel) since 2.1.0

write.ml(RandomForestRegressionModel, character) since 2.1.0

write.ml(RandomForestClassificationModel, character) since 2.1.0

summary(RandomForestRegressionModel) since 2.1.0

summary(RandomForestClassificationModel) since 2.1.0

print.summary.RandomForestRegressionModel since 2.1.0

print.summary.RandomForestClassificationModel since 2.1.0

Examples

## Not run: 
##D # fit a Random Forest Regression Model
##D df <- createDataFrame(longley)
##D model <- spark.randomForest(df, Employed ~ ., type = "regression", maxDepth = 5, maxBins = 16)
##D 
##D # get the summary of the model
##D summary(model)
##D 
##D # make predictions
##D predictions <- predict(model, df)
##D 
##D # save and load the model
##D path <- "path/to/model"
##D write.ml(model, path)
##D savedModel <- read.ml(path)
##D summary(savedModel)
##D 
##D # fit a Random Forest Classification Model
##D df <- createDataFrame(iris)
##D model <- spark.randomForest(df, Species ~ Petal_Length + Petal_Width, "classification")
## End(Not run)

[Package SparkR version 2.1.1 Index]