pyspark.RDD.partitionBy

RDD.partitionBy(numPartitions, partitionFunc=<function portable_hash>)[source]

Return a copy of the RDD partitioned using the specified partitioner.

Examples

>>> pairs = sc.parallelize([1, 2, 3, 4, 2, 4, 1]).map(lambda x: (x, x))
>>> sets = pairs.partitionBy(2).glom().collect()
>>> len(set(sets[0]).intersection(set(sets[1])))
0