pyspark.RDD.cogroup

RDD.cogroup(other, numPartitions=None)[source]

For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in self as well as other.

Examples

>>> x = sc.parallelize([("a", 1), ("b", 4)])
>>> y = sc.parallelize([("a", 2)])
>>> [(x, tuple(map(list, y))) for x, y in sorted(list(x.cogroup(y).collect()))]
[('a', ([1], [2])), ('b', ([4], []))]