pyspark.pandas.Series.to_csv#
- Series.to_csv(path=None, sep=',', na_rep='', columns=None, header=True, quotechar='"', date_format=None, escapechar=None, num_files=None, mode='w', partition_cols=None, index_col=None, **options)#
- Write object to a comma-separated values (csv) file. - Note - pandas-on-Spark to_csv writes files to a path or URI. Unlike pandas’, pandas-on-Spark respects HDFS’s property such as ‘fs.default.name’. - Note - pandas-on-Spark writes CSV files into the directory, path, and writes multiple part-… files in the directory when path is specified. This behavior was inherited from Apache Spark. The number of partitions can be controlled by num_files. This is deprecated. Use DataFrame.spark.repartition instead. - Parameters
- path: str, default None
- File path. If None is provided the result is returned as a string. 
- sep: str, default ‘,’
- String of length 1. Field delimiter for the output file. 
- na_rep: str, default ‘’
- Missing data representation. 
- columns: sequence, optional
- Columns to write. 
- header: bool or list of str, default True
- Write out the column names. If a list of strings is given it is assumed to be aliases for the column names. 
- quotechar: str, default ‘"’
- String of length 1. Character used to quote fields. 
- date_format: str, default None
- Format string for datetime objects. 
- escapechar: str, default None
- String of length 1. Character used to escape sep and quotechar when appropriate. 
- num_files: the number of partitions to be written in `path` directory when
- this is a path. This is deprecated. Use DataFrame.spark.repartition instead. 
- mode: str
- Python write mode, default ‘w’. - Note - mode can accept the strings for Spark writing mode. Such as ‘append’, ‘overwrite’, ‘ignore’, ‘error’, ‘errorifexists’. - ‘append’ (equivalent to ‘a’): Append the new data to existing data. 
- ‘overwrite’ (equivalent to ‘w’): Overwrite existing data. 
- ‘ignore’: Silently ignore this operation if data already exists. 
- ‘error’ or ‘errorifexists’: Throw an exception if data already exists. 
 
- partition_cols: str or list of str, optional, default None
- Names of partitioning columns 
- index_col: str or list of str, optional, default: None
- Column names to be used in Spark to represent pandas-on-Spark’s index. The index name in pandas-on-Spark is ignored. By default, the index is always lost. 
- options: keyword arguments for additional options specific to PySpark.
- These kwargs are specific to PySpark’s CSV options to pass. Check the options in PySpark’s API documentation for spark.write.csv(…). It has higher priority and overwrites all other options. This parameter only works when path is specified. 
 
- Returns
- str or None
 
 - Examples - >>> df = ps.DataFrame(dict( ... date=list(pd.date_range('2012-1-1 12:00:00', periods=3, freq='ME')), ... country=['KR', 'US', 'JP'], ... code=[1, 2 ,3]), columns=['date', 'country', 'code']) >>> df.sort_values(by="date") date country code ... 2012-01-31 12:00:00 KR 1 ... 2012-02-29 12:00:00 US 2 ... 2012-03-31 12:00:00 JP 3 - >>> print(df.to_csv()) date,country,code 2012-01-31 12:00:00,KR,1 2012-02-29 12:00:00,US,2 2012-03-31 12:00:00,JP,3 - >>> df.cummax().to_csv(path=r'%s/to_csv/foo.csv' % path, num_files=1) >>> ps.read_csv( ... path=r'%s/to_csv/foo.csv' % path ... ).sort_values(by="date") date country code ... 2012-01-31 12:00:00 KR 1 ... 2012-02-29 12:00:00 US 2 ... 2012-03-31 12:00:00 US 3 - In case of Series, - >>> print(df.date.to_csv()) date 2012-01-31 12:00:00 2012-02-29 12:00:00 2012-03-31 12:00:00 - >>> df.date.to_csv(path=r'%s/to_csv/foo.csv' % path, num_files=1) >>> ps.read_csv( ... path=r'%s/to_csv/foo.csv' % path ... ).sort_values(by="date") date ... 2012-01-31 12:00:00 ... 2012-02-29 12:00:00 ... 2012-03-31 12:00:00 - You can preserve the index in the roundtrip as below. - >>> df.set_index("country", append=True, inplace=True) >>> df.date.to_csv( ... path=r'%s/to_csv/bar.csv' % path, ... num_files=1, ... index_col=["index1", "index2"]) >>> ps.read_csv( ... path=r'%s/to_csv/bar.csv' % path, index_col=["index1", "index2"] ... ).sort_values(by="date") date index1 index2 ... ... 2012-01-31 12:00:00 ... ... 2012-02-29 12:00:00 ... ... 2012-03-31 12:00:00