coalesce {SparkR}R Documentation

Coalesce

Description

Returns a new SparkDataFrame that has exactly numPartitions partitions. This operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions is requested, it will stay at the current number of partitions.

Returns the first column that is not NA, or NA if all inputs are.

Usage

coalesce(x, ...)

## S4 method for signature 'SparkDataFrame'
coalesce(x, numPartitions)

## S4 method for signature 'Column'
coalesce(x, ...)

Arguments

x

a Column or a SparkDataFrame.

...

additional argument(s). If x is a Column, additional Columns can be optionally provided.

numPartitions

the number of partitions to use.

Details

However, if you're doing a drastic coalesce on a SparkDataFrame, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

Note

coalesce(SparkDataFrame) since 2.1.1

coalesce(Column) since 2.1.1

See Also

repartition

Other SparkDataFrame functions: SparkDataFrame-class, agg, arrange, as.data.frame, attach,SparkDataFrame-method, cache, checkpoint, collect, colnames, coltypes, createOrReplaceTempView, crossJoin, dapplyCollect, dapply, describe, dim, distinct, dropDuplicates, dropna, drop, dtypes, except, explain, filter, first, gapplyCollect, gapply, getNumPartitions, group_by, head, hint, histogram, insertInto, intersect, isLocal, isStreaming, join, limit, merge, mutate, ncol, nrow, persist, printSchema, randomSplit, rbind, registerTempTable, rename, repartition, sample, saveAsTable, schema, selectExpr, select, showDF, show, storageLevel, str, subset, take, toJSON, union, unpersist, withColumn, with, write.df, write.jdbc, write.json, write.orc, write.parquet, write.stream, write.text

Other normal_funcs: abs, bitwiseNOT, column, expr, from_json, greatest, ifelse, isnan, least, lit, nanvl, negate, randn, rand, struct, to_json, when

Examples

## Not run: 
##D sparkR.session()
##D path <- "path/to/file.json"
##D df <- read.json(path)
##D newDF <- coalesce(df, 1L)
## End(Not run)
## Not run: coalesce(df$c, df$d, df$e)

[Package SparkR version 2.2.1 Index]