pyspark.SparkContext.textFile

SparkContext.textFile(name: str, minPartitions: Optional[int] = None, use_unicode: bool = True) → pyspark.rdd.RDD[str][source]

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings. The text files must be encoded as UTF-8.

New in version 0.7.0.

Parameters
namestr

directory to the input data files, the path can be comma separated paths as a list of inputs

minPartitionsint, optional

suggested minimum number of partitions for the resulting RDD

use_unicodebool, default True

If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode.

New in version 1.2.0.

Returns
RDD

RDD representing text data from the file(s).

Examples

>>> import os
>>> import tempfile
>>> with tempfile.TemporaryDirectory() as d:
...     path1 = os.path.join(d, "text1")
...     path2 = os.path.join(d, "text2")
...
...     # Write a temporary text file
...     sc.parallelize(["x", "y", "z"]).saveAsTextFile(path1)
...
...     # Write another temporary text file
...     sc.parallelize(["aa", "bb", "cc"]).saveAsTextFile(path2)
...
...     # Load text file
...     collected1 = sorted(sc.textFile(path1, 3).collect())
...     collected2 = sorted(sc.textFile(path2, 4).collect())
...
...     # Load two text files together
...     collected3 = sorted(sc.textFile('{},{}'.format(path1, path2), 5).collect())
>>> collected1
['x', 'y', 'z']
>>> collected2
['aa', 'bb', 'cc']
>>> collected3
['aa', 'bb', 'cc', 'x', 'y', 'z']