pyspark.sql.DataFrameWriter.bucketBy

DataFrameWriter.bucketBy(numBuckets: int, col: Union[str, List[str], Tuple[str, …]], *cols: Optional[str]) → pyspark.sql.readwriter.DataFrameWriter[source]

Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive’s bucketing scheme, but with a different bucket hash function and is not compatible with Hive’s bucketing.

New in version 2.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
numBucketsint

the number of buckets to save

colstr, list or tuple

a name of a column, or a list of names.

colsstr

additional names (optional). If col is a list it should be empty.

Notes

Applicable for file-based data sources in combination with DataFrameWriter.saveAsTable().

Examples

Write a DataFrame into a Parquet file in a buckted manner, and read it back.

>>> from pyspark.sql.functions import input_file_name
>>> # Write a DataFrame into a Parquet file in a bucketed manner.
... _ = spark.sql("DROP TABLE IF EXISTS bucketed_table")
>>> spark.createDataFrame([
...     (100, "Hyukjin Kwon"), (120, "Hyukjin Kwon"), (140, "Haejoon Lee")],
...     schema=["age", "name"]
... ).write.bucketBy(2, "name").mode("overwrite").saveAsTable("bucketed_table")
>>> # Read the Parquet file as a DataFrame.
... spark.read.table("bucketed_table").sort("age").show()
+---+------------+
|age|        name|
+---+------------+
|100|Hyukjin Kwon|
|120|Hyukjin Kwon|
|140| Haejoon Lee|
+---+------------+
>>> _ = spark.sql("DROP TABLE bucketed_table")