Migration Guide: SparkR (R on Spark)
- Upgrading from SparkR 3.1 to 3.2
- Upgrading from SparkR 2.4 to 3.0
- Upgrading from SparkR 2.3 to 2.4
- Upgrading from SparkR 2.3 to 2.3.1 and above
- Upgrading from SparkR 2.2 to 2.3
- Upgrading from SparkR 2.1 to 2.2
- Upgrading from SparkR 2.0 to 3.1
- Upgrading from SparkR 1.6 to 2.0
- Upgrading from SparkR 1.5 to 1.6
Note that this migration guide describes the items specific to SparkR. Many items of SQL migration can be applied when migrating SparkR to higher versions. Please refer Migration Guide: SQL, Datasets and DataFrame.
Upgrading from SparkR 3.1 to 3.2
- Previously, SparkR automatically downloaded and installed the Spark distribution in user’ cache directory to complete SparkR installation when SparkR runs in a plain R shell or Rscript, and the Spark distribution cannot be found. Now, it asks if users want to download and install or not. To restore the previous behavior, set
SPARKR_ASK_INSTALLATIONenvironment variable to
Upgrading from SparkR 2.4 to 3.0
- The deprecated methods
jsonRDDhave been removed. Use
Upgrading from SparkR 2.3 to 2.4
- Previously, we don’t check the validity of the size of the last layer in
spark.mlp. For example, if the training data only has two labels, a
c(1, 3)doesn’t cause an error previously, now it does.
Upgrading from SparkR 2.3 to 2.3.1 and above
- In SparkR 2.3.0 and earlier, the
substrmethod was wrongly subtracted by one and considered as 0-based. This can lead to inconsistent substring results and also does not match with the behaviour with
substrin R. In version 2.3.1 and later, it has been fixed so the
substrmethod is now 1-based. As an example,
substr(lit('abcdef'), 2, 4))would result to
abcin SparkR 2.3.0, and the result would be
bcdin SparkR 2.3.1.
Upgrading from SparkR 2.2 to 2.3
stringsAsFactorsparameter was previously ignored with
collect, for example, in
collect(createDataFrame(iris), stringsAsFactors = TRUE)). It has been corrected.
summary, option for statistics to compute has been added. Its output is changed from that from
- A warning can be raised if versions of SparkR package and the Spark JVM do not match.
Upgrading from SparkR 2.1 to 2.2
numPartitionsparameter has been added to
as.DataFrame. When splitting the data, the partition position calculation has been made to match the one in Scala.
- The method
createExternalTablehas been deprecated to be replaced by
createTable. Either methods can be called to create external or managed table. Additional catalog methods have also been added.
- By default, derby.log is now saved to
tempdir(). This will be created when instantiating the SparkSession with
spark.ldawas not setting the optimizer correctly. It has been corrected.
- Several model summary outputs are updated to have
matrix. This includes
spark.glm. Model summary outputs for
spark.gaussianMixturehave added log-likelihood as
Upgrading from SparkR 2.0 to 3.1
joinno longer performs Cartesian Product by default, use
Upgrading from SparkR 1.6 to 2.0
- The method
tablehas been removed and replaced by
- The class
DataFramehas been renamed to
SparkDataFrameto avoid name conflicts.
HiveContexthave been deprecated to be replaced by
SparkSession. Instead of
sparkR.session()in its place to instantiate the SparkSession. Once that is done, that currently active SparkSession will be used for SparkDataFrame operations.
- The parameter
sparkExecutorEnvis not supported by
sparkR.session. To set environment for the executors, set Spark config properties with the prefix “spark.executorEnv.VAR_NAME”, for example, “spark.executorEnv.PATH”
sqlContextparameter is no longer required for these functions:
- The method
registerTempTablehas been deprecated to be replaced by
- The method
dropTempTablehas been deprecated to be replaced by
scSparkContext parameter is no longer required for these functions:
Upgrading from SparkR 1.5 to 1.6
- Before Spark 1.6.0, the default mode for writes was
append. It was changed in Spark 1.6.0 to
errorto match the Scala API.
- SparkSQL converts
NAin R to
- Since 1.6.1, withColumn method in SparkR supports adding a new column to or replacing existing columns of the same name of a DataFrame.