pyspark.pandas.DataFrame.to_delta¶

DataFrame.to_delta(path: str, mode: str = 'w', partition_cols: Union[str, List[str], None] = None, index_col: Union[str, List[str], None] = None, **options: OptionalPrimitiveType) → None[source]¶

Write the DataFrame out as a Delta Lake table.

Parameters

pathstr, required

Path to write to.

modestr

Python write mode, default ‘w’.

Note

mode can accept the strings for Spark writing mode. Such as ‘append’, ‘overwrite’, ‘ignore’, ‘error’, ‘errorifexists’.

‘append’ (equivalent to ‘a’): Append the new data to existing data.
‘overwrite’ (equivalent to ‘w’): Overwrite existing data.
‘ignore’: Silently ignore this operation if data already exists.
‘error’ or ‘errorifexists’: Throw an exception if data already exists.

partition_colsstr or list of str, optional, default None

Names of partitioning columns

index_col: str or list of str, optional, default: None

Column names to be used in Spark to represent pandas-on-Spark’s index. The index name in pandas-on-Spark is ignored. By default, the index is always lost.

optionsdict

All other options passed directly into Delta Lake.

See also

read_delta
DataFrame.to_parquet
DataFrame.to_table
DataFrame.to_spark_io

Examples

>>> df = ps.DataFrame(dict(
...    date=list(pd.date_range('2012-1-1 12:00:00', periods=3, freq='M')),
...    country=['KR', 'US', 'JP'],
...    code=[1, 2 ,3]), columns=['date', 'country', 'code'])
>>> df
                 date country  code
0 2012-01-31 12:00:00      KR     1
1 2012-02-29 12:00:00      US     2
2 2012-03-31 12:00:00      JP     3

Create a new Delta Lake table, partitioned by one column:

>>> df.to_delta('%s/to_delta/foo' % path, partition_cols='date')  

Partitioned by two columns:

>>> df.to_delta('%s/to_delta/bar' % path,
...             partition_cols=['date', 'country'])  

Overwrite an existing table’s partitions, using the ‘replaceWhere’ capability in Delta:

>>> df.to_delta('%s/to_delta/bar' % path,
...             mode='overwrite', replaceWhere='date >= "2012-01-01"')  

pyspark.pandas.read_delta pyspark.pandas.read_parquet