pyspark.RDD.saveAsPickleFile

RDD.saveAsPickleFile(path: str, batchSize: int = 10) → None[source]

Save this RDD as a SequenceFile of serialized objects. The serializer used is pyspark.serializers.CPickleSerializer, default batch size is 10.

New in version 1.1.0.

Parameters
pathstr

path to pickled file

batchSizeint, optional, default 10

the number of Python objects represented as a single Java object.

Examples

>>> import os
>>> import tempfile
>>> with tempfile.TemporaryDirectory() as d:
...     path = os.path.join(d, "pickle_file")
...
...     # Write a temporary pickled file
...     sc.parallelize(range(10)).saveAsPickleFile(path, 3)
...
...     # Load picked file as an RDD
...     sorted(sc.pickleFile(path, 3).collect())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]