pyspark.sql.DataFrame.filter#

DataFrame.filter(condition)[source]#

Filters rows using the given condition.

where() is an alias for filter().

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
conditionColumn or str

A Column of types.BooleanType or a string of SQL expressions.

Returns
DataFrame

A new DataFrame with rows that satisfy the condition.

Examples

>>> df = spark.createDataFrame([
...     (2, "Alice", "Math"), (5, "Bob", "Physics"), (7, "Charlie", "Chemistry")],
...     schema=["age", "name", "subject"])

Filter by Column instances.

>>> df.filter(df.age > 3).show()
+---+-------+---------+
|age|   name|  subject|
+---+-------+---------+
|  5|    Bob|  Physics|
|  7|Charlie|Chemistry|
+---+-------+---------+
>>> df.where(df.age == 2).show()
+---+-----+-------+
|age| name|subject|
+---+-----+-------+
|  2|Alice|   Math|
+---+-----+-------+

Filter by SQL expression in a string.

>>> df.filter("age > 3").show()
+---+-------+---------+
|age|   name|  subject|
+---+-------+---------+
|  5|    Bob|  Physics|
|  7|Charlie|Chemistry|
+---+-------+---------+
>>> df.where("age = 2").show()
+---+-----+-------+
|age| name|subject|
+---+-----+-------+
|  2|Alice|   Math|
+---+-----+-------+

Filter by multiple conditions.

>>> df.filter((df.age > 3) & (df.subject == "Physics")).show()
+---+----+-------+
|age|name|subject|
+---+----+-------+
|  5| Bob|Physics|
+---+----+-------+
>>> df.filter((df.age == 2) | (df.subject == "Chemistry")).show()
+---+-------+---------+
|age|   name|  subject|
+---+-------+---------+
|  2|  Alice|     Math|
|  7|Charlie|Chemistry|
+---+-------+---------+

Filter by multiple conditions using SQL expression.

>>> df.filter("age > 3 AND name = 'Bob'").show()
+---+----+-------+
|age|name|subject|
+---+----+-------+
|  5| Bob|Physics|
+---+----+-------+

Filter using the Column.isin() function.

>>> df.filter(df.name.isin("Alice", "Bob")).show()
+---+-----+-------+
|age| name|subject|
+---+-----+-------+
|  2|Alice|   Math|
|  5|  Bob|Physics|
+---+-----+-------+

Filter by a list of values using the Column.isin() function.

>>> df.filter(df.subject.isin(["Math", "Physics"])).show()
+---+-----+-------+
|age| name|subject|
+---+-----+-------+
|  2|Alice|   Math|
|  5|  Bob|Physics|
+---+-----+-------+

Filter using the ~ operator to exclude certain values.

>>> df.filter(~df.name.isin(["Alice", "Charlie"])).show()
+---+----+-------+
|age|name|subject|
+---+----+-------+
|  5| Bob|Physics|
+---+----+-------+

Filter using the Column.isNotNull() function.

>>> df.filter(df.name.isNotNull()).show()
+---+-------+---------+
|age|   name|  subject|
+---+-------+---------+
|  2|  Alice|     Math|
|  5|    Bob|  Physics|
|  7|Charlie|Chemistry|
+---+-------+---------+

Filter using the Column.like() function.

>>> df.filter(df.name.like("Al%")).show()
+---+-----+-------+
|age| name|subject|
+---+-----+-------+
|  2|Alice|   Math|
+---+-----+-------+

Filter using the Column.contains() function.

>>> df.filter(df.name.contains("i")).show()
+---+-------+---------+
|age|   name|  subject|
+---+-------+---------+
|  2|  Alice|     Math|
|  7|Charlie|Chemistry|
+---+-------+---------+

Filter using the Column.between() function.

>>> df.filter(df.age.between(2, 5)).show()
+---+-----+-------+
|age| name|subject|
+---+-----+-------+
|  2|Alice|   Math|
|  5|  Bob|Physics|
+---+-----+-------+