pyspark.SparkContext.textFile#

SparkContext.textFile(name, minPartitions=None, use_unicode=True)[source]#

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings. The text files must be encoded as UTF-8.

New in version 0.7.0.

Parameters

namestr: directory to the input data files, the path can be comma separated paths as a list of inputs
minPartitionsint, optional: suggested minimum number of partitions for the resulting RDD
use_unicodebool, default True: If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode.

New in version 1.2.0.

Returns

RDD: RDD representing text data from the file(s).

See also

RDD.saveAsTextFile()
SparkContext.wholeTextFiles()

Examples

>>> import os
>>> import tempfile
>>> with tempfile.TemporaryDirectory(prefix="textFile") as d:
...     path1 = os.path.join(d, "text1")
...     path2 = os.path.join(d, "text2")
...
...     # Write a temporary text file
...     sc.parallelize(["x", "y", "z"]).saveAsTextFile(path1)
...
...     # Write another temporary text file
...     sc.parallelize(["aa", "bb", "cc"]).saveAsTextFile(path2)
...
...     # Load text file
...     collected1 = sorted(sc.textFile(path1, 3).collect())
...     collected2 = sorted(sc.textFile(path2, 4).collect())
...
...     # Load two text files together
...     collected3 = sorted(sc.textFile('{},{}'.format(path1, path2), 5).collect())

>>> collected1
['x', 'y', 'z']
>>> collected2
['aa', 'bb', 'cc']
>>> collected3
['aa', 'bb', 'cc', 'x', 'y', 'z']