pyspark.sql.datasource.DataSourceReader.pushFilters#

DataSourceReader.pushFilters(filters)[source]#

Called with the list of filters that can be pushed down to the data source.

The list of filters should be interpreted as the AND of the elements.

Filter pushdown allows the data source to handle a subset of filters. This can improve performance by reducing the amount of data that needs to be processed by Spark.

This method is called once during query planning. By default, it returns all filters, indicating that no filters can be pushed down. Subclasses can override this method to implement filter pushdown.

It’s recommended to implement this method only for data sources that natively support filtering, such as databases and GraphQL APIs.

Parameters
filterslist of Filters
Returns
iterable of Filters

Filters that still need to be evaluated by Spark post the data source scan. This includes unsupported filters and partially pushed filters. Every returned filter must be one of the input filters by reference.

Examples

Example filters and the resulting arguments passed to pushFilters:

Filters

Pushdown Arguments

a = 1 and b = 2 a = 1 or b = 2 a = 1 or (b = 2 and c = 3) a = 1 and (b = 2 or c = 3)

[EqualTo((“a”,), 1), EqualTo((“b”,), 2)] [] [] [EqualTo((“a”,), 1)]

Implement pushFilters to support EqualTo filters only:

>>> def pushFilters(self, filters):
...     for filter in filters:
...         if isinstance(filter, EqualTo):
...             # Save supported filter for handling in partitions() and read()
...             self.filters.append(filter)
...         else:
...             # Unsupported filter
...             yield filter