pyspark.sql.DataFrameWriter.bucketBy¶

DataFrameWriter.bucketBy(numBuckets, col, *cols)[source]¶

Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive’s bucketing scheme, but with a different bucket hash function and is not compatible with Hive’s bucketing.

New in version 2.3.0.

Parameters

numBucketsint: the number of buckets to save
colstr, list or tuple: a name of a column, or a list of names.
colsstr: additional names (optional). If col is a list it should be empty.

Notes

Applicable for file-based data sources in combination with DataFrameWriter.saveAsTable().

Examples

>>> (df.write.format('parquet')  
...     .bucketBy(100, 'year', 'month')
...     .mode("overwrite")
...     .saveAsTable('bucketed_table'))

pyspark.sql.DataFrameReader.table pyspark.sql.DataFrameWriter.csv