pyspark.SparkContext.wholeTextFiles#

SparkContext.wholeTextFiles(path, minPartitions=None, use_unicode=True)[source]#

Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. The text files must be encoded as UTF-8.

New in version 1.0.0.

For example, if you have the following files:

hdfs://a-hdfs-path/part-00000
hdfs://a-hdfs-path/part-00001
...
hdfs://a-hdfs-path/part-nnnnn

Do rdd = sparkContext.wholeTextFiles("hdfs://a-hdfs-path"), then rdd contains:

(a-hdfs-path/part-00000, its content)
(a-hdfs-path/part-00001, its content)
...
(a-hdfs-path/part-nnnnn, its content)

Parameters

pathstr: directory to the input data files, the path can be comma separated paths as a list of inputs
minPartitionsint, optional: suggested minimum number of partitions for the resulting RDD
use_unicodebool, default True: If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode.

New in version 1.2.0.

Returns

RDD: RDD representing path-content pairs from the file(s).

See also

RDD.saveAsTextFile()
SparkContext.textFile()

Notes

Small files are preferred, as each file will be loaded fully in memory.

Examples

>>> import os
>>> import tempfile
>>> with tempfile.TemporaryDirectory(prefix="wholeTextFiles") as d:
...     # Write a temporary text file
...     with open(os.path.join(d, "1.txt"), "w") as f:
...         _ = f.write("123")
...
...     # Write another temporary text file
...     with open(os.path.join(d, "2.txt"), "w") as f:
...         _ = f.write("xyz")
...
...     collected = sorted(sc.wholeTextFiles(d).collect())
>>> collected
[('.../1.txt', '123'), ('.../2.txt', 'xyz')]