pyspark.RDD.reduceByKey¶

RDD.reduceByKey(func: Callable[[V, V], V], numPartitions: Optional[int] = None, partitionFunc: Callable[[K], int] = <function portable_hash>) → pyspark.rdd.RDD[Tuple[K, V]][source]¶

Merge the values for each key using an associative and commutative reduce function.

This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce.

Output will be partitioned with numPartitions partitions, or the default parallelism level if numPartitions is not specified. Default partitioner is hash-partition.

New in version 1.6.0.

Parameters

funcfunction: the reduce function
numPartitionsint, optional: the number of partitions in new RDD
partitionFuncfunction, optional, default portable_hash: function to compute the partition index

Returns

RDD: a RDD containing the keys and the aggregated result for each key

See also

RDD.reduceByKeyLocally()
RDD.combineByKey()
RDD.aggregateByKey()
RDD.foldByKey()
RDD.groupByKey()

Examples

>>> from operator import add
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> sorted(rdd.reduceByKey(add).collect())
[('a', 2), ('b', 1)]

pyspark.RDD.reduce pyspark.RDD.reduceByKeyLocally