binaryRecords(path: str, recordLength: int) → pyspark.rdd.RDD[bytes]¶
Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant.
New in version 1.3.0.
Directory to the input data files
The length at which to split the records
RDD of data with values, represented as byte arrays
>>> import os >>> import tempfile >>> with tempfile.TemporaryDirectory() as d: ... # Write a temporary file ... with open(os.path.join(d, "1.bin"), "w") as f: ... for i in range(3): ... _ = f.write("%04d" % i) ... ... # Write another file ... with open(os.path.join(d, "2.bin"), "w") as f: ... for i in [-1, -2, -10]: ... _ = f.write("%04d" % i) ... ... collected = sorted(sc.binaryRecords(d, 4).collect())
>>> collected [b'-001', b'-002', b'-010', b'0000', b'0001', b'0002']