pyspark.SparkContext.union

SparkContext.union(rdds: List[pyspark.rdd.RDD[T]]) → pyspark.rdd.RDD[T][source]

Build the union of a list of RDDs.

This supports unions() of RDDs with different serialized formats, although this forces them to be reserialized using the default serializer:

New in version 0.7.0.

See also

RDD.union()

Examples

>>> import os
>>> import tempfile
>>> with tempfile.TemporaryDirectory() as d:
...     # generate a text RDD
...     with open(os.path.join(d, "union-text.txt"), "w") as f:
...         _ = f.write("Hello")
...     text_rdd = sc.textFile(d)
...
...     # generate another RDD
...     parallelized = sc.parallelize(["World!"])
...
...     unioned = sorted(sc.union([text_rdd, parallelized]).collect())
>>> unioned
['Hello', 'World!']