pyspark.SparkContext.binaryRecords

SparkContext.binaryRecords(path: str, recordLength: int) → pyspark.rdd.RDD[bytes][source]

Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant.

New in version 1.3.0.

Parameters
pathstr

Directory to the input data files

recordLengthint

The length at which to split the records

Returns
RDD

RDD of data with values, represented as byte arrays

Examples

>>> import os
>>> import tempfile
>>> with tempfile.TemporaryDirectory() as d:
...     # Write a temporary file
...     with open(os.path.join(d, "1.bin"), "w") as f:
...         for i in range(3):
...             _ = f.write("%04d" % i)
...
...     # Write another file
...     with open(os.path.join(d, "2.bin"), "w") as f:
...         for i in [-1, -2, -10]:
...             _ = f.write("%04d" % i)
...
...     collected = sorted(sc.binaryRecords(d, 4).collect())
>>> collected
[b'-001', b'-002', b'-010', b'0000', b'0001', b'0002']