spark.kmeans {SparkR}R Documentation

K-Means Clustering Model

Description

Fits a k-means clustering model against a SparkDataFrame, similarly to R's kmeans(). Users can call summary to print a summary of the fitted model, predict to make predictions on new data, and write.ml/read.ml to save/load fitted models.

Usage

spark.kmeans(data, formula, ...)

## S4 method for signature 'SparkDataFrame,formula'
spark.kmeans(data, formula, k = 2,
  maxIter = 20, initMode = c("k-means||", "random"), seed = NULL,
  initSteps = 2, tol = 1e-04)

## S4 method for signature 'KMeansModel'
summary(object)

## S4 method for signature 'KMeansModel'
predict(object, newData)

## S4 method for signature 'KMeansModel,character'
write.ml(object, path, overwrite = FALSE)

Arguments

data

a SparkDataFrame for training.

formula

a symbolic description of the model to be fitted. Currently only a few formula operators are supported, including '~', '.', ':', '+', and '-'. Note that the response variable of formula is empty in spark.kmeans.

...

additional argument(s) passed to the method.

k

number of centers.

maxIter

maximum iteration number.

initMode

the initialization algorithm choosen to fit the model.

seed

the random seed for cluster initialization

initSteps

the number of steps for the k-means|| initialization mode. This is an advanced setting, the default of 2 is almost always enough. Must be > 0.

tol

convergence tolerance of iterations.

object

a fitted k-means model.

newData

a SparkDataFrame for testing.

path

the directory where the model is saved.

overwrite

overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists.

Value

spark.kmeans returns a fitted k-means model.

summary returns summary information of the fitted model, which is a list. The list includes the model's k (the configured number of cluster centers), coefficients (model cluster centers), size (number of data points in each cluster), cluster (cluster centers of the transformed data), is.loaded (whether the model is loaded from a saved file), and clusterSize (the actual number of cluster centers. When using initMode = "random", clusterSize may not equal to k).

predict returns the predicted values based on a k-means model.

Note

spark.kmeans since 2.0.0

summary(KMeansModel) since 2.0.0

predict(KMeansModel) since 2.0.0

write.ml(KMeansModel, character) since 2.0.0

See Also

predict, read.ml, write.ml

Examples

## Not run: 
##D sparkR.session()
##D data(iris)
##D df <- createDataFrame(iris)
##D model <- spark.kmeans(df, Sepal_Length ~ Sepal_Width, k = 4, initMode = "random")
##D summary(model)
##D 
##D # fitted values on training data
##D fitted <- predict(model, df)
##D head(select(fitted, "Sepal_Length", "prediction"))
##D 
##D # save fitted model to input path
##D path <- "path/to/model"
##D write.ml(model, path)
##D 
##D # can also read back the saved model and print
##D savedModel <- read.ml(path)
##D summary(savedModel)
## End(Not run)

[Package SparkR version 2.1.1 Index]