Interface HasPartitionKey

All Superinterfaces:
InputPartition, Serializable

public interface HasPartitionKey extends InputPartition
A mix-in for input partitions whose records are clustered on the same set of partition keys (provided via SupportsReportPartitioning, see below). Data sources can opt-in to implement this interface for the partitions they report to Spark, which will use the information to avoid data shuffling in certain scenarios, such as join, aggregate, etc. Note that Spark requires ALL input partitions to implement this interface, otherwise it can't take advantage of it.

This interface should be used in combination with SupportsReportPartitioning, which allows data sources to report distribution and ordering spec to Spark. In particular, Spark expects data sources to report ClusteredDistribution whenever its input partitions implement this interface. Spark derives partition keys spec (e.g., column names, transforms) from the distribution, and partition values from the input partitions.

It is implementor's responsibility to ensure that when an input partition implements this interface, its records all have the same value for the partition keys. Spark doesn't check this property.

Since:
3.3.0
See Also:
  • Method Summary

    Modifier and Type
    Method
    Description
    org.apache.spark.sql.catalyst.InternalRow
    Returns the value of the partition key(s) associated to this partition.

    Methods inherited from interface org.apache.spark.sql.connector.read.InputPartition

    preferredLocations
  • Method Details

    • partitionKey

      org.apache.spark.sql.catalyst.InternalRow partitionKey()
      Returns the value of the partition key(s) associated to this partition. An input partition implementing this interface needs to ensure that all its records have the same value for the partition keys. Note that the value is after partition transform has been applied, if there is any.