pyspark.pandas.DataFrame.query#
- DataFrame.query(expr, inplace=False)[source]#
- Query the columns of a DataFrame with a boolean expression. - Note - Internal columns that starting with a ‘__’ prefix are able to access, however, they are not supposed to be accessed. - Note - This API delegates to Spark SQL so the syntax follows Spark SQL. Therefore, the pandas specific syntax such as @ is not supported. If you want the pandas syntax, you can work around with - DataFrame.pandas_on_spark.apply_batch(), but you should be aware that query_func will be executed at different nodes in a distributed manner. So, for example to use @ syntax, make sure the variable is serialized by putting it within the closure as below.- >>> df = ps.DataFrame({'A': range(2000), 'B': range(2000)}) >>> def query_func(pdf): ... num = 1995 ... return pdf.query('A > @num') >>> df.pandas_on_spark.apply_batch(query_func) A B 1996 1996 1996 1997 1997 1997 1998 1998 1998 1999 1999 1999 - Parameters
- exprstr
- The query string to evaluate. - You can refer to column names that contain spaces by surrounding them in backticks. - For example, if one of your columns is called - a aand you want to sum it with- b, your query should be- `a a` + b.
- inplacebool
- Whether the query should modify the data in place or return a modified copy. 
 
- Returns
- DataFrame
- DataFrame resulting from the provided query expression. 
 
 - Examples - >>> df = ps.DataFrame({'A': range(1, 6), ... 'B': range(10, 0, -2), ... 'C C': range(10, 5, -1)}) >>> df A B C C 0 1 10 10 1 2 8 9 2 3 6 8 3 4 4 7 4 5 2 6 - >>> df.query('A > B') A B C C 4 5 2 6 - The previous expression is equivalent to - >>> df[df.A > df.B] A B C C 4 5 2 6 - For columns with spaces in their name, you can use backtick quoting. - >>> df.query('B == `C C`') A B C C 0 1 10 10 - The previous expression is equivalent to - >>> df[df.B == df['C C']] A B C C 0 1 10 10