pyspark.RDD.repartitionAndSortWithinPartitions¶
-
RDD.
repartitionAndSortWithinPartitions
(numPartitions: Optional[int] = None, partitionFunc: Callable[[Any], int] = <function portable_hash>, ascending: bool = True, keyfunc: Callable[[Any], Any] = <function RDD.<lambda>>) → pyspark.rdd.RDD[Tuple[Any, Any]][source]¶ Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys.
New in version 1.2.0.
- Parameters
- numPartitionsint, optional
the number of partitions in new
RDD
- partitionFuncfunction, optional, default portable_hash
a function to compute the partition index
- ascendingbool, optional, default True
sort the keys in ascending or descending order
- keyfuncfunction, optional, default identity mapping
a function to compute the key
- Returns
Examples
>>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 3)]) >>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, True) >>> rdd2.glom().collect() [[(0, 5), (0, 8), (2, 6)], [(1, 3), (3, 8), (3, 8)]]