pyspark.SparkContext.binaryRecords¶
-
SparkContext.
binaryRecords
(path: str, recordLength: int) → pyspark.rdd.RDD[bytes][source]¶ Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant.
New in version 1.3.0.
- Parameters
- pathstr
Directory to the input data files
- recordLengthint
The length at which to split the records
- Returns
RDD
RDD of data with values, represented as byte arrays
See also
Examples
>>> import os >>> import tempfile >>> with tempfile.TemporaryDirectory() as d: ... # Write a temporary file ... with open(os.path.join(d, "1.bin"), "w") as f: ... for i in range(3): ... _ = f.write("%04d" % i) ... ... # Write another file ... with open(os.path.join(d, "2.bin"), "w") as f: ... for i in [-1, -2, -10]: ... _ = f.write("%04d" % i) ... ... collected = sorted(sc.binaryRecords(d, 4).collect())
>>> collected [b'-001', b'-002', b'-010', b'0000', b'0001', b'0002']