binaryFiles(path: str, minPartitions: Optional[int] = None) → pyspark.rdd.RDD[Tuple[str, bytes]]¶
Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
New in version 1.3.0.
directory to the input data files, the path can be comma separated paths as a list of inputs
- minPartitionsint, optional
suggested minimum number of partitions for the resulting RDD
RDD representing path-content pairs from the file(s).
Small files are preferred, large file is also allowable, but may cause bad performance.
>>> import os >>> import tempfile >>> with tempfile.TemporaryDirectory() as d: ... # Write a temporary binary file ... with open(os.path.join(d, "1.bin"), "wb") as f1: ... _ = f1.write(b"binary data I") ... ... # Write another temporary binary file ... with open(os.path.join(d, "2.bin"), "wb") as f2: ... _ = f2.write(b"binary data II") ... ... collected = sorted(sc.binaryFiles(d).collect())
>>> collected [('.../1.bin', b'binary data I'), ('.../2.bin', b'binary data II')]