spark.als {SparkR}R Documentation

Alternating Least Squares (ALS) for Collaborative Filtering

Description

spark.als learns latent factors in collaborative filtering via alternating least squares. Users can call summary to obtain fitted latent factors, predict to make predictions on new data, and write.ml/read.ml to save/load fitted models.

Usage

spark.als(data, ...)

## S4 method for signature 'SparkDataFrame'
spark.als(
  data,
  ratingCol = "rating",
  userCol = "user",
  itemCol = "item",
  rank = 10,
  regParam = 0.1,
  maxIter = 10,
  nonnegative = FALSE,
  implicitPrefs = FALSE,
  alpha = 1,
  numUserBlocks = 10,
  numItemBlocks = 10,
  checkpointInterval = 10,
  seed = 0
)

## S4 method for signature 'ALSModel'
summary(object)

## S4 method for signature 'ALSModel'
predict(object, newData)

## S4 method for signature 'ALSModel,character'
write.ml(object, path, overwrite = FALSE)

Arguments

data

a SparkDataFrame for training.

...

additional argument(s) passed to the method.

ratingCol

column name for ratings.

userCol

column name for user ids. Ids must be (or can be coerced into) integers.

itemCol

column name for item ids. Ids must be (or can be coerced into) integers.

rank

rank of the matrix factorization (> 0).

regParam

regularization parameter (>= 0).

maxIter

maximum number of iterations (>= 0).

nonnegative

logical value indicating whether to apply nonnegativity constraints.

implicitPrefs

logical value indicating whether to use implicit preference.

alpha

alpha parameter in the implicit preference formulation (>= 0).

numUserBlocks

number of user blocks used to parallelize computation (> 0).

numItemBlocks

number of item blocks used to parallelize computation (> 0).

checkpointInterval

number of checkpoint intervals (>= 1) or disable checkpoint (-1). Note: this setting will be ignored if the checkpoint directory is not set.

seed

integer seed for random number generation.

object

a fitted ALS model.

newData

a SparkDataFrame for testing.

path

the directory where the model is saved.

overwrite

logical value indicating whether to overwrite if the output path already exists. Default is FALSE which means throw exception if the output path exists.

Details

For more details, see MLlib: Collaborative Filtering.

Value

spark.als returns a fitted ALS model.

summary returns summary information of the fitted model, which is a list. The list includes user (the names of the user column), item (the item column), rating (the rating column), userFactors (the estimated user factors), itemFactors (the estimated item factors), and rank (rank of the matrix factorization model).

predict returns a SparkDataFrame containing predicted values.

Note

spark.als since 2.1.0

the input rating dataframe to the ALS implementation should be deterministic. Nondeterministic data can cause failure during fitting ALS model. For example, an order-sensitive operation like sampling after a repartition makes dataframe output nondeterministic, like sample(repartition(df, 2L), FALSE, 0.5, 1618L). Checkpointing sampled dataframe or adding a sort before sampling can help make the dataframe deterministic.

summary(ALSModel) since 2.1.0

predict(ALSModel) since 2.1.0

write.ml(ALSModel, character) since 2.1.0

See Also

read.ml

Examples

## Not run: 
##D ratings <- list(list(0, 0, 4.0), list(0, 1, 2.0), list(1, 1, 3.0), list(1, 2, 4.0),
##D                 list(2, 1, 1.0), list(2, 2, 5.0))
##D df <- createDataFrame(ratings, c("user", "item", "rating"))
##D model <- spark.als(df, "rating", "user", "item")
##D 
##D # extract latent factors
##D stats <- summary(model)
##D userFactors <- stats$userFactors
##D itemFactors <- stats$itemFactors
##D 
##D # make predictions
##D predicted <- predict(model, df)
##D showDF(predicted)
##D 
##D # save and load the model
##D path <- "path/to/model"
##D write.ml(model, path)
##D savedModel <- read.ml(path)
##D summary(savedModel)
##D 
##D # set other arguments
##D modelS <- spark.als(df, "rating", "user", "item", rank = 20,
##D                     regParam = 0.1, nonnegative = TRUE)
##D statsS <- summary(modelS)
## End(Not run)

[Package SparkR version 3.0.0 Index]