pyspark.sql.DataFrame.filter#
- DataFrame.filter(condition)[source]#
Filters rows using the given condition.
where()
is an alias forfilter()
.New in version 1.3.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- condition
Column
or str A
Column
oftypes.BooleanType
or a string of SQL expressions.
- condition
- Returns
DataFrame
A new DataFrame with rows that satisfy the condition.
Examples
>>> df = spark.createDataFrame([ ... (2, "Alice", "Math"), (5, "Bob", "Physics"), (7, "Charlie", "Chemistry")], ... schema=["age", "name", "subject"])
Filter by
Column
instances.>>> df.filter(df.age > 3).show() +---+-------+---------+ |age| name| subject| +---+-------+---------+ | 5| Bob| Physics| | 7|Charlie|Chemistry| +---+-------+---------+ >>> df.where(df.age == 2).show() +---+-----+-------+ |age| name|subject| +---+-----+-------+ | 2|Alice| Math| +---+-----+-------+
Filter by SQL expression in a string.
>>> df.filter("age > 3").show() +---+-------+---------+ |age| name| subject| +---+-------+---------+ | 5| Bob| Physics| | 7|Charlie|Chemistry| +---+-------+---------+ >>> df.where("age = 2").show() +---+-----+-------+ |age| name|subject| +---+-----+-------+ | 2|Alice| Math| +---+-----+-------+
Filter by multiple conditions.
>>> df.filter((df.age > 3) & (df.subject == "Physics")).show() +---+----+-------+ |age|name|subject| +---+----+-------+ | 5| Bob|Physics| +---+----+-------+ >>> df.filter((df.age == 2) | (df.subject == "Chemistry")).show() +---+-------+---------+ |age| name| subject| +---+-------+---------+ | 2| Alice| Math| | 7|Charlie|Chemistry| +---+-------+---------+
Filter by multiple conditions using SQL expression.
>>> df.filter("age > 3 AND name = 'Bob'").show() +---+----+-------+ |age|name|subject| +---+----+-------+ | 5| Bob|Physics| +---+----+-------+
Filter using the
Column.isin()
function.>>> df.filter(df.name.isin("Alice", "Bob")).show() +---+-----+-------+ |age| name|subject| +---+-----+-------+ | 2|Alice| Math| | 5| Bob|Physics| +---+-----+-------+
Filter by a list of values using the
Column.isin()
function.>>> df.filter(df.subject.isin(["Math", "Physics"])).show() +---+-----+-------+ |age| name|subject| +---+-----+-------+ | 2|Alice| Math| | 5| Bob|Physics| +---+-----+-------+
Filter using the ~ operator to exclude certain values.
>>> df.filter(~df.name.isin(["Alice", "Charlie"])).show() +---+----+-------+ |age|name|subject| +---+----+-------+ | 5| Bob|Physics| +---+----+-------+
Filter using the
Column.isNotNull()
function.>>> df.filter(df.name.isNotNull()).show() +---+-------+---------+ |age| name| subject| +---+-------+---------+ | 2| Alice| Math| | 5| Bob| Physics| | 7|Charlie|Chemistry| +---+-------+---------+
Filter using the
Column.like()
function.>>> df.filter(df.name.like("Al%")).show() +---+-----+-------+ |age| name|subject| +---+-----+-------+ | 2|Alice| Math| +---+-----+-------+
Filter using the
Column.contains()
function.>>> df.filter(df.name.contains("i")).show() +---+-------+---------+ |age| name| subject| +---+-------+---------+ | 2| Alice| Math| | 7|Charlie|Chemistry| +---+-------+---------+
Filter using the
Column.between()
function.>>> df.filter(df.age.between(2, 5)).show() +---+-----+-------+ |age| name|subject| +---+-----+-------+ | 2|Alice| Math| | 5| Bob|Physics| +---+-----+-------+