public interface HasPartitionKey extends InputPartition
SupportsReportPartitioning
, see below). Data sources can opt-in to
implement this interface for the partitions they report to Spark, which will use the
information to avoid data shuffling in certain scenarios, such as join, aggregate, etc. Note
that Spark requires ALL input partitions to implement this interface, otherwise it can't take
advantage of it.
This interface should be used in combination with SupportsReportPartitioning
, which
allows data sources to report distribution and ordering spec to Spark. In particular, Spark
expects data sources to report
ClusteredDistribution
whenever its input
partitions implement this interface. Spark derives partition keys spec (e.g., column names,
transforms) from the distribution, and partition values from the input partitions.
It is implementor's responsibility to ensure that when an input partition implements this interface, its records all have the same value for the partition keys. Spark doesn't check this property.
SupportsReportPartitioning
,
Partitioning
Modifier and Type | Method and Description |
---|---|
org.apache.spark.sql.catalyst.InternalRow |
partitionKey()
Returns the value of the partition key(s) associated to this partition.
|
preferredLocations
org.apache.spark.sql.catalyst.InternalRow partitionKey()