Spark Session

The entry point to programming Spark with the Dataset and DataFrame API. To create a Spark session, you should use SparkSession.builder attribute. See also SparkSession.

SparkSession.active()

Returns the active or default SparkSession for the current thread, returned by the builder.

SparkSession.builder.appName(name)

Sets a name for the application, which will be shown in the Spark web UI.

SparkSession.builder.config([key, value, …])

Sets a config option.

SparkSession.builder.enableHiveSupport()

Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive SerDes, and Hive user-defined functions.

SparkSession.builder.getOrCreate()

Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.

SparkSession.builder.master(master)

Sets the Spark master URL to connect to, such as “local” to run locally, “local[4]” to run locally with 4 cores, or “spark://master:7077” to run on a Spark standalone cluster.

SparkSession.builder.remote(url)

Sets the Spark remote URL to connect to, such as “sc://host:port” to run it via Spark Connect server.

SparkSession.catalog

Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.

SparkSession.conf

Runtime configuration interface for Spark.

SparkSession.createDataFrame(data[, schema, …])

Creates a DataFrame from an RDD, a list, a pandas.DataFrame or a numpy.ndarray.

SparkSession.getActiveSession()

Returns the active SparkSession for the current thread, returned by the builder

SparkSession.newSession()

Returns a new SparkSession as new session, that has separate SQLConf, registered temporary views and UDFs, but shared SparkContext and table cache.

SparkSession.range(start[, end, step, …])

Create a DataFrame with single pyspark.sql.types.LongType column named id, containing elements in a range from start to end (exclusive) with step value step.

SparkSession.read

Returns a DataFrameReader that can be used to read data in as a DataFrame.

SparkSession.readStream

Returns a DataStreamReader that can be used to read data streams as a streaming DataFrame.

SparkSession.sparkContext

Returns the underlying SparkContext.

SparkSession.sql(sqlQuery[, args])

Returns a DataFrame representing the result of the given query.

SparkSession.stop()

Stop the underlying SparkContext.

SparkSession.streams

Returns a StreamingQueryManager that allows managing all the StreamingQuery instances active on this context.

SparkSession.table(tableName)

Returns the specified table as a DataFrame.

SparkSession.udf

Returns a UDFRegistration for UDF registration.

SparkSession.udtf

Returns a UDTFRegistration for UDTF registration.

SparkSession.version

The version of Spark on which this application is running.

Spark Connect Only

SparkSession.builder.create()

Creates a new SparkSession.

SparkSession.addArtifact(*path[, pyfile, …])

Add artifact(s) to the client session.

SparkSession.addArtifacts(*path[, pyfile, …])

Add artifact(s) to the client session.

SparkSession.copyFromLocalToFs(local_path, …)

Copy file from local to cloud storage file system.

SparkSession.client

Gives access to the Spark Connect client.

SparkSession.interruptAll()

Interrupt all operations of this session currently running on the connected server.

SparkSession.interruptTag(tag)

Interrupt all operations of this session with the given operation tag.

SparkSession.interruptOperation(op_id)

Interrupt an operation of this session with the given operationId.

SparkSession.addTag(tag)

Add a tag to be assigned to all the operations started by this thread in this session.

SparkSession.removeTag(tag)

Remove a tag previously added to be assigned to all the operations started by this thread in this session.

SparkSession.getTags()

Get the tags that are currently set to be assigned to all the operations started by this thread.

SparkSession.clearTags()

Clear the current thread’s operation tags.