pyspark.sql.
SparkSession
The entry point to programming Spark with the Dataset and DataFrame API.
A SparkSession can be used to create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:
DataFrame
Changed in version 3.4.0: Supports Spark Connect.
builder
Examples
Create a Spark session.
>>> spark = ( ... SparkSession.builder ... .master("local") ... .appName("Word Count") ... .config("spark.some.config.option", "some-value") ... .getOrCreate() ... )
Create a Spark session with Spark Connect.
>>> spark = ( ... SparkSession.builder ... .remote("sc://localhost") ... .appName("Word Count") ... .config("spark.some.config.option", "some-value") ... .getOrCreate() ... )
Methods
createDataFrame(data[, schema, …])
createDataFrame
Creates a DataFrame from an RDD, a list, a pandas.DataFrame or a numpy.ndarray.
RDD
pandas.DataFrame
numpy.ndarray
getActiveSession()
getActiveSession
Returns the active SparkSession for the current thread, returned by the builder
newSession()
newSession
Returns a new SparkSession as new session, that has separate SQLConf, registered temporary views and UDFs, but shared SparkContext and table cache.
SparkContext
range(start[, end, step, numPartitions])
range
Create a DataFrame with single pyspark.sql.types.LongType column named id, containing elements in a range from start to end (exclusive) with step value step.
pyspark.sql.types.LongType
id
start
end
step
sql(sqlQuery[, args])
sql
Returns a DataFrame representing the result of the given query.
stop()
stop
Stop the underlying SparkContext.
table(tableName)
table
Returns the specified table as a DataFrame.
Attributes
catalog
Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.
conf
Runtime configuration interface for Spark.
read
Returns a DataFrameReader that can be used to read data in as a DataFrame.
DataFrameReader
readStream
Returns a DataStreamReader that can be used to read data streams as a streaming DataFrame.
DataStreamReader
sparkContext
Returns the underlying SparkContext.
streams
Returns a StreamingQueryManager that allows managing all the StreamingQuery instances active on this context.
StreamingQueryManager
StreamingQuery
udf
Returns a UDFRegistration for UDF registration.
UDFRegistration
version
The version of Spark on which this application is running.