Package pyspark :: Module serializers
[frames] | no frames]

Module serializers

source code

PySpark supports custom serializers for transferring data; this can improve performance.

By default, PySpark uses PickleSerializer to serialize objects using Python's cPickle serializer, which can serialize nearly any Python object. Other serializers, like MarshalSerializer, support fewer datatypes but can be faster.

The serializer is chosen when creating SparkContext:

>>> from pyspark.context import SparkContext
>>> from pyspark.serializers import MarshalSerializer
>>> sc = SparkContext('local', 'test', serializer=MarshalSerializer())
>>> sc.parallelize(list(range(1000))).map(lambda x: 2 * x).take(10)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
>>> sc.stop()

By default, PySpark serialize objects in batches; the batch size can be controlled through SparkContext's batchSize parameter (the default size is 1024 objects):

>>> sc = SparkContext('local', 'test', batchSize=2)
>>> rdd = sc.parallelize(range(16), 4).map(lambda x: x)

Behind the scenes, this creates a JavaRDD with four partitions, each of which contains two batches of two objects:

>>> rdd.glom().collect()
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]]
>>> rdd._jrdd.count()
8L
>>> sc.stop()

A batch size of -1 uses an unlimited batch size, and a size of 1 disables batching:

>>> sc = SparkContext('local', 'test', batchSize=1)
>>> rdd = sc.parallelize(range(16), 4).map(lambda x: x)
>>> rdd.glom().collect()
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]]
>>> rdd._jrdd.count()
16L
Classes
  PickleSerializer
Serializes objects using Python's cPickle serializer:
  MarshalSerializer
Serializes objects using Python's Marshal serializer: