pyspark.RDD.saveAsTextFile#
- RDD.saveAsTextFile(path, compressionCodecClass=None)[source]#
Save this RDD as a text file, using string representations of elements.
New in version 0.7.0.
- Parameters
- pathstr
path to text file
- compressionCodecClassstr, optional
fully qualified classname of the compression codec class i.e. “org.apache.hadoop.io.compress.GzipCodec” (None by default)
Examples
>>> import os >>> import tempfile >>> from fileinput import input >>> from glob import glob >>> with tempfile.TemporaryDirectory(prefix="saveAsTextFile1") as d1: ... path1 = os.path.join(d1, "text_file1") ... ... # Write a temporary text file ... sc.parallelize(range(10)).saveAsTextFile(path1) ... ... # Load text file as an RDD ... ''.join(sorted(input(glob(path1 + "/part-0000*")))) '0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n'
Empty lines are tolerated when saving to text files.
>>> with tempfile.TemporaryDirectory(prefix="saveAsTextFile2") as d2: ... path2 = os.path.join(d2, "text2_file2") ... ... # Write another temporary text file ... sc.parallelize(['', 'foo', '', 'bar', '']).saveAsTextFile(path2) ... ... # Load text file as an RDD ... ''.join(sorted(input(glob(path2 + "/part-0000*")))) '\n\n\nbar\nfoo\n'
Using compressionCodecClass
>>> from fileinput import input, hook_compressed >>> with tempfile.TemporaryDirectory(prefix="saveAsTextFile3") as d3: ... path3 = os.path.join(d3, "text3") ... codec = "org.apache.hadoop.io.compress.GzipCodec" ... ... # Write another temporary text file with specified codec ... sc.parallelize(['foo', 'bar']).saveAsTextFile(path3, codec) ... ... # Load text file as an RDD ... result = sorted(input(glob(path3 + "/part*.gz"), openhook=hook_compressed)) ... ''.join([r.decode('utf-8') if isinstance(r, bytes) else r for r in result]) 'bar\nfoo\n'