pyspark.SparkContext.binaryFiles#

SparkContext.binaryFiles(path, minPartitions=None)[source]#

Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

New in version 1.3.0.

Parameters

pathstr: directory to the input data files, the path can be comma separated paths as a list of inputs
minPartitionsint, optional: suggested minimum number of partitions for the resulting RDD

Returns

RDD: RDD representing path-content pairs from the file(s).

See also

SparkContext.binaryRecords()

Notes

Small files are preferred, large file is also allowable, but may cause bad performance.

Examples

>>> import os
>>> import tempfile
>>> with tempfile.TemporaryDirectory(prefix="binaryFiles") as d:
...     # Write a temporary binary file
...     with open(os.path.join(d, "1.bin"), "wb") as f1:
...         _ = f1.write(b"binary data I")
...
...     # Write another temporary binary file
...     with open(os.path.join(d, "2.bin"), "wb") as f2:
...         _ = f2.write(b"binary data II")
...
...     collected = sorted(sc.binaryFiles(d).collect())

>>> collected
[('.../1.bin', b'binary data I'), ('.../2.bin', b'binary data II')]