pyspark.SparkContext.binaryFiles

SparkContext.binaryFiles(path: str, minPartitions: Optional[int] = None) → pyspark.rdd.RDD[Tuple[str, bytes]][source]

Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

Notes

Small files are preferred, large file is also allowable, but may cause bad performance.