pyspark.RDD.join¶
- 
RDD.join(other: pyspark.rdd.RDD[Tuple[K, U]], numPartitions: Optional[int] = None) → pyspark.rdd.RDD[Tuple[K, Tuple[V, U]]][source]¶
- Return an RDD containing all pairs of elements with matching keys in self and other. - Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other. - Performs a hash join across the cluster. - New in version 0.7.0. - Parameters
- Returns
 - See also - Examples - >>> rdd1 = sc.parallelize([("a", 1), ("b", 4)]) >>> rdd2 = sc.parallelize([("a", 2), ("a", 3)]) >>> sorted(rdd1.join(rdd2).collect()) [('a', (1, 2)), ('a', (1, 3))]