pyspark.sql.SparkSession.sql

SparkSession.sql(sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) → pyspark.sql.dataframe.DataFrame[source]

Returns a DataFrame representing the result of the given query. When kwargs is specified, this method formats the given string by using the Python standard formatter. The method binds named parameters to SQL literals from args.

New in version 2.0.0.

Changed in version 3.4.0: Supports Spark Connect and parameterized SQL.

Parameters
sqlQuerystr

SQL query string.

argsdict

A dictionary of parameter names to Python objects that can be converted to SQL literal expressions. See <a href=”https://spark.apache.org/docs/latest/sql-ref-datatypes.html”> Supported Data Types</a> for supported value types in Python. For example, dictionary keys: “rank”, “name”, “birthdate”; dictionary values: 1, “Steven”, datetime.date(2023, 4, 2). Map value can be also a Column of literal expression, in that case it is taken as is.

New in version 3.4.0.

kwargsdict

Other variables that the user wants to set that can be referenced in the query

Changed in version 3.3.0: Added optional argument kwargs to specify the mapping of variables in the query. This feature is experimental and unstable.

Returns
DataFrame

Examples

Executing a SQL query.

>>> spark.sql("SELECT * FROM range(10) where id > 7").show()
+---+
| id|
+---+
|  8|
|  9|
+---+

Executing a SQL query with variables as Python formatter standard.

>>> spark.sql(
...     "SELECT * FROM range(10) WHERE id > {bound1} AND id < {bound2}", bound1=7, bound2=9
... ).show()
+---+
| id|
+---+
|  8|
+---+
>>> mydf = spark.range(10)
>>> spark.sql(
...     "SELECT {col} FROM {mydf} WHERE id IN {x}",
...     col=mydf.id, mydf=mydf, x=tuple(range(4))).show()
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
+---+
>>> spark.sql('''
...   SELECT m1.a, m2.b
...   FROM {table1} m1 INNER JOIN {table2} m2
...   ON m1.key = m2.key
...   ORDER BY m1.a, m2.b''',
...   table1=spark.createDataFrame([(1, "a"), (2, "b")], ["a", "key"]),
...   table2=spark.createDataFrame([(3, "a"), (4, "b"), (5, "b")], ["b", "key"])).show()
+---+---+
|  a|  b|
+---+---+
|  1|  3|
|  2|  4|
|  2|  5|
+---+---+

Also, it is possible to query using class:Column from DataFrame.

>>> mydf = spark.createDataFrame([(1, 4), (2, 4), (3, 6)], ["A", "B"])
>>> spark.sql("SELECT {df.A}, {df[B]} FROM {df}", df=mydf).show()
+---+---+
|  A|  B|
+---+---+
|  1|  4|
|  2|  4|
|  3|  6|
+---+---+

And substitude named parameters with the : prefix by SQL literals.

>>> spark.sql("SELECT * FROM {df} WHERE {df[B]} > :minB", {"minB" : 5}, df=mydf).show()
+---+---+
|  A|  B|
+---+---+
|  3|  6|
+---+---+