read

package read

Package Members

package colstats
package partitioning
package streaming

Type Members

trait Batch extends AnyRef
A physical representation of a data source scan for batch queries.
A physical representation of a data source scan for batch queries. This interface is used to provide physical information, like how many partitions the scanned data has, and how to read records from the partitions.
Annotations
@Evolving()
Since
3.0.0
trait HasPartitionKey extends InputPartition
A mix-in for input partitions whose records are clustered on the same set of partition keys (provided via SupportsReportPartitioning, see below).
A mix-in for input partitions whose records are clustered on the same set of partition keys (provided via SupportsReportPartitioning, see below). Data sources can opt-in to implement this interface for the partitions they report to Spark, which will use the information to avoid data shuffling in certain scenarios, such as join, aggregate, etc. Note that Spark requires ALL input partitions to implement this interface, otherwise it can't take advantage of it.
This interface should be used in combination with SupportsReportPartitioning, which allows data sources to report distribution and ordering spec to Spark. In particular, Spark expects data sources to report org.apache.spark.sql.connector.distributions.ClusteredDistribution whenever its input partitions implement this interface. Spark derives partition keys spec (e.g., column names, transforms) from the distribution, and partition values from the input partitions.
It is implementor's responsibility to ensure that when an input partition implements this interface, its records all have the same value for the partition keys. Spark doesn't check this property.
Since
3.3.0
See also
org.apache.spark.sql.connector.read.SupportsReportPartitioning
org.apache.spark.sql.connector.read.partitioning.Partitioning
trait HasPartitionStatistics extends InputPartition
A mix-in for input partitions whose records are clustered on the same set of partition keys (provided via SupportsReportPartitioning, see below).
A mix-in for input partitions whose records are clustered on the same set of partition keys (provided via SupportsReportPartitioning, see below). Data sources can opt-in to implement this interface for the partitions they report to Spark, which will use the info to decide whether partition grouping should be applied or not.
Since
4.0.0
See also
org.apache.spark.sql.connector.read.SupportsReportPartitioning
trait InputPartition extends Serializable
A serializable representation of an input partition returned by Batch#planInputPartitions() and the corresponding ones in streaming .
A serializable representation of an input partition returned by Batch#planInputPartitions() and the corresponding ones in streaming .
Note that InputPartition will be serialized and sent to executors, then PartitionReader will be created by PartitionReaderFactory#createReader(InputPartition) or PartitionReaderFactory#createColumnarReader(InputPartition) on executors to do the actual reading. So InputPartition must be serializable while PartitionReader doesn't need to be.
Annotations
@Evolving()
Since
3.0.0
trait LocalScan extends Scan
A special Scan which will happen on Driver locally instead of Executors.
A special Scan which will happen on Driver locally instead of Executors.
Annotations
@Experimental()
Since
3.2.0
trait PartitionReader[T] extends Closeable
A partition reader returned by PartitionReaderFactory#createReader(InputPartition) or PartitionReaderFactory#createColumnarReader(InputPartition).
A partition reader returned by PartitionReaderFactory#createReader(InputPartition) or PartitionReaderFactory#createColumnarReader(InputPartition). It's responsible for outputting data for a RDD partition.
Note that, Currently the type T can only be org.apache.spark.sql.catalyst.InternalRow for normal data sources, or org.apache.spark.sql.vectorized.ColumnarBatch for columnar data sources(whose PartitionReaderFactory#supportColumnarReads(InputPartition) returns true).
Annotations
@Evolving()
Since
3.0.0
trait PartitionReaderFactory extends Serializable
A factory used to create PartitionReader instances.
A factory used to create PartitionReader instances.
If Spark fails to execute any methods in the implementations of this interface or in the returned PartitionReader (by throwing an exception), corresponding Spark task would fail and get retried until hitting the maximum retry times.
Annotations
@Evolving()
Since
3.0.0
trait Scan extends AnyRef
A logical representation of a data source scan.
A logical representation of a data source scan. This interface is used to provide logical information, like what the actual read schema is.
This logical representation is shared between batch scan, micro-batch streaming scan and continuous streaming scan. Data sources must implement the corresponding methods in this interface, to match what the table promises to support. For example, #toBatch() must be implemented, if the Table that creates this Scan returns TableCapability#BATCH_READ support in its Table#capabilities().
Annotations
@Evolving()
Since
3.0.0
trait ScanBuilder extends AnyRef
An interface for building the Scan.
An interface for building the Scan. Implementations can mixin SupportsPushDownXYZ interfaces to do operator push down, and keep the operator push down result in the returned Scan. When pushing down operators, the push down order is: sample -> filter -> aggregate -> limit/top-n(sort + limit) -> offset -> column pruning.
Annotations
@Evolving()
Since
3.0.0
trait Statistics extends AnyRef
An interface to represent statistics for a data source, which is returned by SupportsReportStatistics#estimateStatistics().
An interface to represent statistics for a data source, which is returned by SupportsReportStatistics#estimateStatistics().
Annotations
@Evolving()
Since
3.0.0
trait SupportsPushDownAggregates extends ScanBuilder
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder. Data sources can implement this interface to push down aggregates.
If the data source can't fully complete the grouping work, then #supportCompletePushDown(Aggregation) should return false, and Spark will group the data source output again. For queries like "SELECT min(value) AS m FROM t GROUP BY key", after pushing down the aggregate to the data source, the data source can still output data with duplicated keys, which is OK as Spark will do GROUP BY key again. The final query plan can be something like this:
```
  Aggregate [key#1], [min(min_value#2) AS m#3]
    +- RelationV2[key#1, min_value#2]
```
Similarly, if there is no grouping expression, the data source can still output more than one rows.
When pushing down operators, Spark pushes down filters to the data source first, then push down aggregates or apply column pruning. Depends on data source implementation, aggregates may or may not be able to be pushed down with filters. If pushed filters still need to be evaluated after scanning, aggregates can't be pushed down.
Annotations
@Evolving()
Since
3.2.0
trait SupportsPushDownFilters extends ScanBuilder
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder. Data sources can implement this interface to push down filters to the data source and reduce the size of the data to be read.
Annotations
@Evolving()
Since
3.0.0
trait SupportsPushDownLimit extends ScanBuilder
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder. Data sources can implement this interface to push down LIMIT. We can push down LIMIT with many other operations if they follow the operator order we defined in ScanBuilder's class doc.
Annotations
@Evolving()
Since
3.3.0
trait SupportsPushDownOffset extends ScanBuilder
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder. Data sources can implement this interface to push down OFFSET. We can push down OFFSET with many other operations if they follow the operator order we defined in ScanBuilder's class doc.
Annotations
@Evolving()
Since
3.4.0
trait SupportsPushDownRequiredColumns extends ScanBuilder
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder. Data sources can implement this interface to push down required columns to the data source and only read these columns during scan to reduce the size of the data to be read.
Annotations
@Evolving()
Since
3.0.0
trait SupportsPushDownTableSample extends ScanBuilder
A mix-in interface for Scan.
A mix-in interface for Scan. Data sources can implement this interface to push down SAMPLE.
Annotations
@Evolving()
Since
3.3.0
trait SupportsPushDownTopN extends ScanBuilder
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder. Data sources can implement this interface to push down top N(query with ORDER BY ... LIMIT n). We can push down top N with many other operations if they follow the operator order we defined in ScanBuilder's class doc.
Annotations
@Evolving()
Since
3.3.0
trait SupportsPushDownV2Filters extends ScanBuilder
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder. Data sources can implement this interface to push down V2 Predicate to the data source and reduce the size of the data to be read. Please Note that this interface is preferred over SupportsPushDownFilters, which uses V1 org.apache.spark.sql.sources.Filter and is less efficient due to the internal -> external data conversion.
Annotations
@Evolving()
Since
3.3.0
trait SupportsReportOrdering extends Scan
A mix in interface for Scan.
A mix in interface for Scan. Data sources can implement this interface to report the order of data in each partition to Spark. Global order is part of the partitioning, see SupportsReportPartitioning.
Spark uses ordering information to exploit existing order to avoid sorting required by subsequent operations.
Annotations
@Evolving()
Since
3.4.0
trait SupportsReportPartitioning extends Scan
A mix in interface for Scan.
A mix in interface for Scan. Data sources can implement this interface to report data partitioning and try to avoid shuffle at Spark side.
Note that, when a Scan implementation creates exactly one InputPartition, Spark may avoid adding a shuffle even if the reader does not implement this interface.
Annotations
@Evolving()
Since
3.0.0
trait SupportsReportStatistics extends Scan
A mix in interface for Scan.
A mix in interface for Scan. Data sources can implement this interface to report statistics to Spark.
As of Spark 3.0, statistics are reported to the optimizer after operators are pushed to the data source. Implementations may return more accurate statistics based on pushed operators which may improve query performance by providing better information to the optimizer.
Annotations
@Evolving()
Since
3.0.0
trait SupportsRuntimeFiltering extends SupportsRuntimeV2Filtering
A mix-in interface for Scan.
A mix-in interface for Scan. Data sources can implement this interface if they can filter initially planned InputPartitions using predicates Spark infers at runtime.
Note that Spark will push runtime filters only if they are beneficial.
Annotations
@Experimental()
Since
3.2.0
trait SupportsRuntimeV2Filtering extends Scan
A mix-in interface for Scan.
A mix-in interface for Scan. Data sources can implement this interface if they can filter initially planned InputPartitions using predicates Spark infers at runtime. This interface is very similar to SupportsRuntimeFiltering except it uses data source V2 Predicate instead of data source V1 Filter. SupportsRuntimeV2Filtering is preferred over SupportsRuntimeFiltering and only one of them should be implemented by the data sources.
Note that Spark will push runtime filters only if they are beneficial.
Annotations
@Experimental()
Since
3.4.0
trait V1Scan extends Scan
A trait that should be implemented by V1 DataSources that would like to leverage the DataSource V2 read code paths.
A trait that should be implemented by V1 DataSources that would like to leverage the DataSource V2 read code paths.
This interface is designed to provide Spark DataSources time to migrate to DataSource V2 and will be removed in a future Spark release.
Annotations
@Unstable()
Since
3.0.0

Ungrouped

trait Batch extends AnyRef
A physical representation of a data source scan for batch queries.
A physical representation of a data source scan for batch queries. This interface is used to provide physical information, like how many partitions the scanned data has, and how to read records from the partitions.
Annotations
@Evolving()
Since
3.0.0
trait HasPartitionKey extends InputPartition
A mix-in for input partitions whose records are clustered on the same set of partition keys (provided via SupportsReportPartitioning, see below).
A mix-in for input partitions whose records are clustered on the same set of partition keys (provided via SupportsReportPartitioning, see below). Data sources can opt-in to implement this interface for the partitions they report to Spark, which will use the information to avoid data shuffling in certain scenarios, such as join, aggregate, etc. Note that Spark requires ALL input partitions to implement this interface, otherwise it can't take advantage of it.
This interface should be used in combination with SupportsReportPartitioning, which allows data sources to report distribution and ordering spec to Spark. In particular, Spark expects data sources to report org.apache.spark.sql.connector.distributions.ClusteredDistribution whenever its input partitions implement this interface. Spark derives partition keys spec (e.g., column names, transforms) from the distribution, and partition values from the input partitions.
It is implementor's responsibility to ensure that when an input partition implements this interface, its records all have the same value for the partition keys. Spark doesn't check this property.
Since
3.3.0
See also
org.apache.spark.sql.connector.read.SupportsReportPartitioning
org.apache.spark.sql.connector.read.partitioning.Partitioning
trait HasPartitionStatistics extends InputPartition
A mix-in for input partitions whose records are clustered on the same set of partition keys (provided via SupportsReportPartitioning, see below).
A mix-in for input partitions whose records are clustered on the same set of partition keys (provided via SupportsReportPartitioning, see below). Data sources can opt-in to implement this interface for the partitions they report to Spark, which will use the info to decide whether partition grouping should be applied or not.
Since
4.0.0
See also
org.apache.spark.sql.connector.read.SupportsReportPartitioning
trait InputPartition extends Serializable
A serializable representation of an input partition returned by Batch#planInputPartitions() and the corresponding ones in streaming .
A serializable representation of an input partition returned by Batch#planInputPartitions() and the corresponding ones in streaming .
Note that InputPartition will be serialized and sent to executors, then PartitionReader will be created by PartitionReaderFactory#createReader(InputPartition) or PartitionReaderFactory#createColumnarReader(InputPartition) on executors to do the actual reading. So InputPartition must be serializable while PartitionReader doesn't need to be.
Annotations
@Evolving()
Since
3.0.0
trait LocalScan extends Scan
A special Scan which will happen on Driver locally instead of Executors.
A special Scan which will happen on Driver locally instead of Executors.
Annotations
@Experimental()
Since
3.2.0
trait PartitionReader[T] extends Closeable
A partition reader returned by PartitionReaderFactory#createReader(InputPartition) or PartitionReaderFactory#createColumnarReader(InputPartition).
A partition reader returned by PartitionReaderFactory#createReader(InputPartition) or PartitionReaderFactory#createColumnarReader(InputPartition). It's responsible for outputting data for a RDD partition.
Note that, Currently the type T can only be org.apache.spark.sql.catalyst.InternalRow for normal data sources, or org.apache.spark.sql.vectorized.ColumnarBatch for columnar data sources(whose PartitionReaderFactory#supportColumnarReads(InputPartition) returns true).
Annotations
@Evolving()
Since
3.0.0
trait PartitionReaderFactory extends Serializable
A factory used to create PartitionReader instances.
A factory used to create PartitionReader instances.
If Spark fails to execute any methods in the implementations of this interface or in the returned PartitionReader (by throwing an exception), corresponding Spark task would fail and get retried until hitting the maximum retry times.
Annotations
@Evolving()
Since
3.0.0
trait Scan extends AnyRef
A logical representation of a data source scan.
A logical representation of a data source scan. This interface is used to provide logical information, like what the actual read schema is.
This logical representation is shared between batch scan, micro-batch streaming scan and continuous streaming scan. Data sources must implement the corresponding methods in this interface, to match what the table promises to support. For example, #toBatch() must be implemented, if the Table that creates this Scan returns TableCapability#BATCH_READ support in its Table#capabilities().
Annotations
@Evolving()
Since
3.0.0
trait ScanBuilder extends AnyRef
An interface for building the Scan.
An interface for building the Scan. Implementations can mixin SupportsPushDownXYZ interfaces to do operator push down, and keep the operator push down result in the returned Scan. When pushing down operators, the push down order is: sample -> filter -> aggregate -> limit/top-n(sort + limit) -> offset -> column pruning.
Annotations
@Evolving()
Since
3.0.0
trait Statistics extends AnyRef
An interface to represent statistics for a data source, which is returned by SupportsReportStatistics#estimateStatistics().
An interface to represent statistics for a data source, which is returned by SupportsReportStatistics#estimateStatistics().
Annotations
@Evolving()
Since
3.0.0
trait SupportsPushDownAggregates extends ScanBuilder
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder. Data sources can implement this interface to push down aggregates.
If the data source can't fully complete the grouping work, then #supportCompletePushDown(Aggregation) should return false, and Spark will group the data source output again. For queries like "SELECT min(value) AS m FROM t GROUP BY key", after pushing down the aggregate to the data source, the data source can still output data with duplicated keys, which is OK as Spark will do GROUP BY key again. The final query plan can be something like this:
```
  Aggregate [key#1], [min(min_value#2) AS m#3]
    +- RelationV2[key#1, min_value#2]
```
Similarly, if there is no grouping expression, the data source can still output more than one rows.
When pushing down operators, Spark pushes down filters to the data source first, then push down aggregates or apply column pruning. Depends on data source implementation, aggregates may or may not be able to be pushed down with filters. If pushed filters still need to be evaluated after scanning, aggregates can't be pushed down.
Annotations
@Evolving()
Since
3.2.0
trait SupportsPushDownFilters extends ScanBuilder
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder. Data sources can implement this interface to push down filters to the data source and reduce the size of the data to be read.
Annotations
@Evolving()
Since
3.0.0
trait SupportsPushDownLimit extends ScanBuilder
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder. Data sources can implement this interface to push down LIMIT. We can push down LIMIT with many other operations if they follow the operator order we defined in ScanBuilder's class doc.
Annotations
@Evolving()
Since
3.3.0
trait SupportsPushDownOffset extends ScanBuilder
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder. Data sources can implement this interface to push down OFFSET. We can push down OFFSET with many other operations if they follow the operator order we defined in ScanBuilder's class doc.
Annotations
@Evolving()
Since
3.4.0
trait SupportsPushDownRequiredColumns extends ScanBuilder
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder. Data sources can implement this interface to push down required columns to the data source and only read these columns during scan to reduce the size of the data to be read.
Annotations
@Evolving()
Since
3.0.0
trait SupportsPushDownTableSample extends ScanBuilder
A mix-in interface for Scan.
A mix-in interface for Scan. Data sources can implement this interface to push down SAMPLE.
Annotations
@Evolving()
Since
3.3.0
trait SupportsPushDownTopN extends ScanBuilder
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder. Data sources can implement this interface to push down top N(query with ORDER BY ... LIMIT n). We can push down top N with many other operations if they follow the operator order we defined in ScanBuilder's class doc.
Annotations
@Evolving()
Since
3.3.0
trait SupportsPushDownV2Filters extends ScanBuilder
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder. Data sources can implement this interface to push down V2 Predicate to the data source and reduce the size of the data to be read. Please Note that this interface is preferred over SupportsPushDownFilters, which uses V1 org.apache.spark.sql.sources.Filter and is less efficient due to the internal -> external data conversion.
Annotations
@Evolving()
Since
3.3.0
trait SupportsReportOrdering extends Scan
A mix in interface for Scan.
A mix in interface for Scan. Data sources can implement this interface to report the order of data in each partition to Spark. Global order is part of the partitioning, see SupportsReportPartitioning.
Spark uses ordering information to exploit existing order to avoid sorting required by subsequent operations.
Annotations
@Evolving()
Since
3.4.0
trait SupportsReportPartitioning extends Scan
A mix in interface for Scan.
A mix in interface for Scan. Data sources can implement this interface to report data partitioning and try to avoid shuffle at Spark side.
Note that, when a Scan implementation creates exactly one InputPartition, Spark may avoid adding a shuffle even if the reader does not implement this interface.
Annotations
@Evolving()
Since
3.0.0
trait SupportsReportStatistics extends Scan
A mix in interface for Scan.
A mix in interface for Scan. Data sources can implement this interface to report statistics to Spark.
As of Spark 3.0, statistics are reported to the optimizer after operators are pushed to the data source. Implementations may return more accurate statistics based on pushed operators which may improve query performance by providing better information to the optimizer.
Annotations
@Evolving()
Since
3.0.0
trait SupportsRuntimeFiltering extends SupportsRuntimeV2Filtering
A mix-in interface for Scan.
A mix-in interface for Scan. Data sources can implement this interface if they can filter initially planned InputPartitions using predicates Spark infers at runtime.
Note that Spark will push runtime filters only if they are beneficial.
Annotations
@Experimental()
Since
3.2.0
trait SupportsRuntimeV2Filtering extends Scan
A mix-in interface for Scan.
A mix-in interface for Scan. Data sources can implement this interface if they can filter initially planned InputPartitions using predicates Spark infers at runtime. This interface is very similar to SupportsRuntimeFiltering except it uses data source V2 Predicate instead of data source V1 Filter. SupportsRuntimeV2Filtering is preferred over SupportsRuntimeFiltering and only one of them should be implemented by the data sources.
Note that Spark will push runtime filters only if they are beneficial.
Annotations
@Experimental()
Since
3.4.0
trait V1Scan extends Scan
A trait that should be implemented by V1 DataSources that would like to leverage the DataSource V2 read code paths.
A trait that should be implemented by V1 DataSources that would like to leverage the DataSource V2 read code paths.
This interface is designed to provide Spark DataSources time to migrate to DataSource V2 and will be removed in a future Spark release.
Annotations
@Unstable()
Since
3.0.0

Packages

read

package read

Package Members

Type Members

Ungrouped

read