Interface HasPartitionKey
- All Superinterfaces:
InputPartition
,Serializable
A mix-in for input partitions whose records are clustered on the same set of partition keys
(provided via
SupportsReportPartitioning
, see below). Data sources can opt-in to
implement this interface for the partitions they report to Spark, which will use the
information to avoid data shuffling in certain scenarios, such as join, aggregate, etc. Note
that Spark requires ALL input partitions to implement this interface, otherwise it can't take
advantage of it.
This interface should be used in combination with SupportsReportPartitioning
, which
allows data sources to report distribution and ordering spec to Spark. In particular, Spark
expects data sources to report
ClusteredDistribution
whenever its input
partitions implement this interface. Spark derives partition keys spec (e.g., column names,
transforms) from the distribution, and partition values from the input partitions.
It is implementor's responsibility to ensure that when an input partition implements this interface, its records all have the same value for the partition keys. Spark doesn't check this property.
- Since:
- 3.3.0
- See Also:
-
Method Summary
Modifier and TypeMethodDescriptionorg.apache.spark.sql.catalyst.InternalRow
Returns the value of the partition key(s) associated to this partition.Methods inherited from interface org.apache.spark.sql.connector.read.InputPartition
preferredLocations
-
Method Details
-
partitionKey
org.apache.spark.sql.catalyst.InternalRow partitionKey()Returns the value of the partition key(s) associated to this partition. An input partition implementing this interface needs to ensure that all its records have the same value for the partition keys. Note that the value is after partition transform has been applied, if there is any.
-