pyspark.sql.DataFrameWriter.bucketBy

DataFrameWriter.bucketBy(numBuckets, col, *cols)[source]

Buckets the output by the given columns.If specified, the output is laid out on the file system similar to Hive’s bucketing scheme.

New in version 2.3.0.

Parameters
numBucketsint

the number of buckets to save

colstr, list or tuple

a name of a column, or a list of names.

colsstr

additional names (optional). If col is a list it should be empty.

Notes

Applicable for file-based data sources in combination with DataFrameWriter.saveAsTable().

Examples

>>> (df.write.format('parquet')  
...     .bucketBy(100, 'year', 'month')
...     .mode("overwrite")
...     .saveAsTable('bucketed_table'))