pyspark.SparkContext.binaryFiles#
- SparkContext.binaryFiles(path, minPartitions=None)[source]#
- Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. - New in version 1.3.0. - Parameters
- pathstr
- directory to the input data files, the path can be comma separated paths as a list of inputs 
- minPartitionsint, optional
- suggested minimum number of partitions for the resulting RDD 
 
- Returns
- RDD
- RDD representing path-content pairs from the file(s). 
 
 - See also - Notes - Small files are preferred, large file is also allowable, but may cause bad performance. - Examples - >>> import os >>> import tempfile >>> with tempfile.TemporaryDirectory(prefix="binaryFiles") as d: ... # Write a temporary binary file ... with open(os.path.join(d, "1.bin"), "wb") as f1: ... _ = f1.write(b"binary data I") ... ... # Write another temporary binary file ... with open(os.path.join(d, "2.bin"), "wb") as f2: ... _ = f2.write(b"binary data II") ... ... collected = sorted(sc.binaryFiles(d).collect()) - >>> collected [('.../1.bin', b'binary data I'), ('.../2.bin', b'binary data II')]