pyspark.SparkContext.binaryRecords#

SparkContext.binaryRecords(path, recordLength)[source]#

Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant.

New in version 1.3.0.

Parameters

pathstr: Directory to the input data files
recordLengthint: The length at which to split the records

Returns

RDD: RDD of data with values, represented as byte arrays

See also

SparkContext.binaryFiles()

Examples

>>> import os
>>> import tempfile
>>> with tempfile.TemporaryDirectory(prefix="binaryRecords") as d:
...     # Write a temporary file
...     with open(os.path.join(d, "1.bin"), "w") as f:
...         for i in range(3):
...             _ = f.write("%04d" % i)
...
...     # Write another file
...     with open(os.path.join(d, "2.bin"), "w") as f:
...         for i in [-1, -2, -10]:
...             _ = f.write("%04d" % i)
...
...     collected = sorted(sc.binaryRecords(d, 4).collect())

>>> collected
[b'-001', b'-002', b'-010', b'0000', b'0001', b'0002']