package sql
Allows the execution of relational queries, including those expressed in SQL using Spark.
- Source
- package.scala
- Alphabetic
- By Inheritance
- sql
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Package Members
-    package apiContains API classes that are specific to a single language (i.e. Contains API classes that are specific to a single language (i.e. Java). 
-  package catalog
-  package catalyst
-  package columnar
-  package connector
-  package expressions
-  package jdbc
-  package pipelines
-  package protobuf
-    package sourcesA set of APIs for adding data sources to Spark SQL. 
-  package streaming
-    package typesContains a type system for attributes produced by relations, including complex types like structs, arrays and maps. 
-  package util
-  package vectorized
Type Members
-    class AnalysisException extends Exception with SparkThrowable with Serializable with WithOriginThrown when a query fails to analyze, usually because the query itself is invalid. Thrown when a query fails to analyze, usually because the query itself is invalid. - Annotations
- @Stable()
- Since
- 1.3.0 
 
-    class Column extends Logging with TableValuedFunctionArgumentA column that will be computed based on the data in a DataFrame.A column that will be computed based on the data in a DataFrame.A new column can be constructed based on the input columns present in a DataFrame: df("columnName") // On a specific `df` DataFrame. col("columnName") // A generic column not yet associated with a DataFrame. col("columnName.field") // Extracting a struct field col("`a.column.with.dots`") // Escape `.` in column names. $"columnName" // Scala short hand for a named column. Column objects can be composed to form complex expressions: $"a" + 1 $"a" === $"b" - Annotations
- @Stable()
- Since
- 1.3.0 
 
-    class ColumnName extends ColumnA convenient class used for constructing schema. A convenient class used for constructing schema. - Annotations
- @Stable()
- Since
- 1.3.0 
 
-    trait CreateTableWriter[T] extends WriteConfigMethods[CreateTableWriter[T]]Trait to restrict calls to create and replace operations. Trait to restrict calls to create and replace operations. - Since
- 3.0.0 
 
-  type DataFrame = Dataset[Row]
-   abstract  class DataFrameNaFunctions extends AnyRefFunctionality for working with missing data in DataFrames.Functionality for working with missing data in DataFrames.- Annotations
- @Stable()
- Since
- 1.3.1 
 
-   abstract  class DataFrameReader extends AnyRefInterface used to load a Dataset from external storage systems (e.g. Interface used to load a Dataset from external storage systems (e.g. file systems, key-value stores, etc). Use SparkSession.readto access this.- Annotations
- @Stable()
- Since
- 1.4.0 
 
-   abstract  class DataFrameStatFunctions extends AnyRefStatistic functions for DataFrames.Statistic functions for DataFrames.- Annotations
- @Stable()
- Since
- 1.4.0 
 
-   abstract  class DataFrameWriter[T] extends AnyRefInterface used to write a org.apache.spark.sql.Dataset to external storage systems (e.g. Interface used to write a org.apache.spark.sql.Dataset to external storage systems (e.g. file systems, key-value stores, etc). Use Dataset.writeto access this.- Annotations
- @Stable()
- Since
- 1.4.0 
 
-   abstract  class DataFrameWriterV2[T] extends CreateTableWriter[T]Interface used to write a org.apache.spark.sql.Dataset to external storage using the v2 API. Interface used to write a org.apache.spark.sql.Dataset to external storage using the v2 API. - Annotations
- @Experimental()
- Since
- 3.0.0 
 
-   abstract  class Dataset[T] extends SerializableA Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of org.apache.spark.sql.Row.Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate ( groupBy). Example actions count, show, or writing data out to file systems.Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. To explore the logical plan as well as optimized physical plan, use the explainfunction.To efficiently support domain-specific objects, an org.apache.spark.sql.Encoder is required. The encoder maps the domain specific type Tto Spark's internal type system. For example, given a classPersonwith two fields,name(string) andage(int), an encoder is used to tell Spark to generate code at runtime to serialize thePersonobject into a binary structure. This binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing (e.g. in a columnar format). To understand the internal binary representation for data, use theschemafunction.There are typically two ways to create a Dataset. The most common way is by pointing Spark to some files on storage systems, using the readfunction available on aSparkSession.val people = spark.read.parquet("...").as[Person] // Scala Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one: val names = people.map(_.name) // in Scala; names is a Dataset[String] Dataset<String> names = people.map( (MapFunction<Person, String>) p -> p.name, Encoders.STRING()); // Java Dataset operations can also be untyped, through various domain-specific-language (DSL) functions defined in: Dataset (this class), org.apache.spark.sql.Column, and org.apache.spark.sql.functions. These operations are very similar to the operations available in the data frame abstraction in R or Python. To select a column from the Dataset, use applymethod in Scala andcolin Java.val ageCol = people("age") // in Scala Column ageCol = people.col("age"); // in Java Note that the org.apache.spark.sql.Column type can also be manipulated through its various functions. // The following creates a new column that increases everybody's age by 10. people("age") + 10 // in Scala people.col("age").plus(10); // in Java A more concrete example in Scala: // To create Dataset[Row] using SparkSession val people = spark.read.parquet("...") val department = spark.read.parquet("...") people.filter("age > 30") .join(department, people("deptId") === department("id")) .groupBy(department("name"), people("gender")) .agg(avg(people("salary")), max(people("age"))) and in Java: // To create Dataset<Row> using SparkSession Dataset<Row> people = spark.read().parquet("..."); Dataset<Row> department = spark.read().parquet("..."); people.filter(people.col("age").gt(30)) .join(department, people.col("deptId").equalTo(department.col("id"))) .groupBy(department.col("name"), people.col("gender")) .agg(avg(people.col("salary")), max(people.col("age"))); - Annotations
- @Stable()
- Since
- 1.6.0 
 
-   abstract  class DatasetHolder[T] extends AnyRefA container for a org.apache.spark.sql.Dataset, used for implicit conversions in Scala. A container for a org.apache.spark.sql.Dataset, used for implicit conversions in Scala. To use this, import implicit conversions in SQL: val spark: SparkSession = ... import spark.implicits._ - Annotations
- @Stable()
- Since
- 1.6.0 
 
-    trait Encoder[T] extends SerializableUsed to convert a JVM object of type Tto and from the internal Spark SQL representation.Used to convert a JVM object of type Tto and from the internal Spark SQL representation.ScalaEncoders are generally created automatically through implicits from a SparkSession, or can be explicitly created by calling static methods on Encoders.import spark.implicits._ val ds = Seq(1, 2, 3).toDS() // implicitly provided (spark.implicits.newIntEncoder) JavaEncoders are specified by calling static methods on Encoders. List<String> data = Arrays.asList("abc", "abc", "xyz"); Dataset<String> ds = context.createDataset(data, Encoders.STRING()); Encoders can be composed into tuples: Encoder<Tuple2<Integer, String>> encoder2 = Encoders.tuple(Encoders.INT(), Encoders.STRING()); List<Tuple2<Integer, String>> data2 = Arrays.asList(new scala.Tuple2(1, "a"); Dataset<Tuple2<Integer, String>> ds2 = context.createDataset(data2, encoder2); Or constructed from Java Beans: Encoders.bean(MyClass.class);Implementation- Encoders should be thread-safe.
 - Annotations
- @implicitNotFound()
- Since
- 1.6.0 
 
-    trait EncoderImplicits extends LowPrioritySQLImplicits with SerializableEncoderImplicits used to implicitly generate SQL Encoders. EncoderImplicits used to implicitly generate SQL Encoders. Note that these functions don't rely on or expose SparkSession.
-    class ExperimentalMethods extends AnyRef:: Experimental :: Holder for experimental methods for the bravest. :: Experimental :: Holder for experimental methods for the bravest. We make NO guarantee about the stability regarding binary compatibility and source compatibility of methods here. spark.experimental.extraStrategies += ... - Annotations
- @Experimental() @Unstable()
- Since
- 1.3.0 
 
-    trait ExtendedExplainGenerator extends AnyRefA trait for a session extension to implement that provides addition explain plan information. A trait for a session extension to implement that provides addition explain plan information. - Annotations
- @DeveloperApi() @Since("4.0.0")
 
-   abstract  class ForeachWriter[T] extends SerializableThe abstract class for writing custom logic to process data generated by a query. The abstract class for writing custom logic to process data generated by a query. This is often used to write the output of a streaming query to arbitrary storage systems. Any implementation of this base class will be used by Spark in the following way. - A single instance of this class is responsible of all the data generated by a single task in a query. In other words, one instance is responsible for processing one partition of the data generated in a distributed manner.
- Any implementation of this class must be serializable because each task will get a fresh
serialized-deserialized copy of the provided object. Hence, it is strongly recommended that any
initialization for writing data (e.g. opening a connection or starting a transaction) is done
after the open(...)method has been called, which signifies that the task is ready to generate data.
- The lifecycle of the methods are as follows.
 For each partition with `partitionId`: For each batch/epoch of streaming data (if its streaming query) with `epochId`: Method `open(partitionId, epochId)` is called. If `open` returns true: For each row in the partition and batch/epoch, method `process(row)` is called. Method `close(errorOrNull)` is called with error (if any) seen while processing rows. Important points to note: - Spark doesn't guarantee same output for (partitionId,
epochId), so deduplication cannot be achieved with (partitionId, epochId). e.g. source provides
different number of partitions for some reason, Spark optimization changes number of
partitions, etc. Refer SPARK-28650 for more details. If you need deduplication on output, try
out foreachBatchinstead.
- The close()method will be called ifopen()method returns successfully (irrespective of the return value), except if the JVM crashes in the middle.
 Scala example: datasetOfString.writeStream.foreach(new ForeachWriter[String] { def open(partitionId: Long, version: Long): Boolean = { // open connection } def process(record: String) = { // write string to connection } def close(errorOrNull: Throwable): Unit = { // close the connection } }) Java example: datasetOfString.writeStream().foreach(new ForeachWriter<String>() { @Override public boolean open(long partitionId, long version) { // open connection } @Override public void process(String value) { // write string to connection } @Override public void close(Throwable errorOrNull) { // close the connection } }); - Since
- 2.0.0 
 
-   abstract  class KeyValueGroupedDataset[K, V] extends SerializableA Dataset has been logically grouped by a user specified grouping key. A Dataset has been logically grouped by a user specified grouping key. Users should not construct a KeyValueGroupedDataset directly, but should instead call groupByKeyon an existing Dataset.- Since
- 2.0.0 
 
-    trait LowPrioritySQLImplicits extends AnyRefLower priority implicit methods for converting Scala objects into org.apache.spark.sql.Datasets. Lower priority implicit methods for converting Scala objects into org.apache.spark.sql.Datasets. Conflicting implicits are placed here to disambiguate resolution. Reasons for including specific implicits: newProductEncoder - to disambiguate for Lists which are bothSeqandProduct
-   abstract  class MergeIntoWriter[T] extends AnyRefMergeIntoWriterprovides methods to define and execute merge actions based on specified conditions.MergeIntoWriterprovides methods to define and execute merge actions based on specified conditions.Please note that schema evolution is disabled by default. - T
- the type of data in the Dataset. 
 - Annotations
- @Experimental()
- Since
- 4.0.0 
 
-    class Observation extends AnyRefHelper class to simplify usage of Dataset.observe(String, Column, Column*):Helper class to simplify usage of Dataset.observe(String, Column, Column*):// Observe row count (rows) and highest id (maxid) in the Dataset while writing it val observation = Observation("my metrics") val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), max($"id").as("maxid")) observed_ds.write.parquet("ds.parquet") val metrics = observation.get This collects the metrics while the first action is executed on the observed dataset. Subsequent actions do not modify the metrics returned by get. Retrieval of the metric via get blocks until the first action has finished and metrics become available. This class does not support streaming datasets. - Since
- 3.3.0 
 
-   abstract  class RelationalGroupedDataset extends AnyRefA set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup (and alsopivot).A set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup (and alsopivot).The main method is the aggfunction, which has multiple variants. This class also contains some first-order statistics such asmean,sumfor convenience.- Annotations
- @Stable()
- Since
- 2.0.0 
- Note
- This class was named - GroupedDatain Spark 1.x.
 
-    trait Row extends SerializableRepresents one row of output from a relational operator. Represents one row of output from a relational operator. Allows both generic access by ordinal, which will incur boxing overhead for primitives, as well as native primitive access. It is invalid to use the native primitive interface to retrieve a value that is null, instead a user must check isNullAtbefore attempting to retrieve a value that might be null.To create a new Row, use RowFactory.create()in Java orRow.apply()in Scala.A Row object can be constructed by providing field values. Example: import org.apache.spark.sql._ // Create a Row from values. Row(value1, value2, value3, ...) // Create a Row from a Seq of values. Row.fromSeq(Seq(value1, value2, ...)) A value of a row can be accessed through both generic access by ordinal, which will incur boxing overhead for primitives, as well as native primitive access. An example of generic access by ordinal: import org.apache.spark.sql._ val row = Row(1, true, "a string", null) // row: Row = [1,true,a string,null] val firstValue = row(0) // firstValue: Any = 1 val fourthValue = row(3) // fourthValue: Any = null For native primitive access, it is invalid to use the native primitive interface to retrieve a value that is null, instead a user must check isNullAtbefore attempting to retrieve a value that might be null. An example of native primitive access:// using the row from the previous example. val firstValue = row.getInt(0) // firstValue: Int = 1 val isNull = row.isNullAt(3) // isNull: Boolean = true In Scala, fields in a Row object can be extracted in a pattern match. Example: import org.apache.spark.sql._ val pairs = sql("SELECT key, value FROM src").rdd.map { case Row(key: Int, value: String) => key -> value } - Annotations
- @Stable()
- Since
- 1.3.0 
 
-    class RowFactory extends AnyRefA factory class used to construct Rowobjects.A factory class used to construct Rowobjects.- Annotations
- @Stable()
- Since
- 1.3.0 
 
-   abstract  class RuntimeConfig extends AnyRefRuntime configuration interface for Spark. Runtime configuration interface for Spark. To access this, use SparkSession.conf.Options set here are automatically propagated to the Hadoop configuration during I/O. - Annotations
- @Stable()
- Since
- 2.0.0 
 
-   abstract  class SQLContext extends Logging with SerializableThe entry point for working with structured data (rows and columns) in Spark 1.x. The entry point for working with structured data (rows and columns) in Spark 1.x. As of Spark 2.0, this is replaced by SparkSession. However, we are keeping the class here for backward compatibility. - Annotations
- @Stable()
- Since
- 1.0.0 
 
-   abstract  class SQLImplicits extends EncoderImplicits with SerializableA collection of implicit methods for converting common Scala objects into org.apache.spark.sql.Datasets. A collection of implicit methods for converting common Scala objects into org.apache.spark.sql.Datasets. - Since
- 1.6.0 
 
-   sealed final  class SaveMode extends Enum[SaveMode]SaveMode is used to specify the expected behavior of saving a DataFrame to a data source. SaveMode is used to specify the expected behavior of saving a DataFrame to a data source. - Annotations
- @Stable()
- Since
- 1.3.0 
 
-   abstract  class SparkSession extends Serializable with CloseableThe entry point to programming Spark with the Dataset and DataFrame API. The entry point to programming Spark with the Dataset and DataFrame API. In environments that this has been created upfront (e.g. REPL, notebooks), use the builder to get an existing session: SparkSession.builder().getOrCreate() The builder can also be used to create a new session: SparkSession.builder .master("local") .appName("Word Count") .config("spark.some.config.option", "some-value") .getOrCreate() 
-    class SparkSessionExtensions extends AnyRef:: Experimental :: Holder for injection points to the SparkSession. :: Experimental :: Holder for injection points to the SparkSession. We make NO guarantee about the stability regarding binary compatibility and source compatibility of methods here. This current provides the following extension points: - Analyzer Rules.
- Check Analysis Rules.
- Cache Plan Normalization Rules.
- Optimizer Rules.
- Pre CBO Rules.
- Planning Strategies.
- Customized Parser.
- (External) Catalog listeners.
- Columnar Rules.
- Adaptive Query Post Planner Strategy Rules.
- Adaptive Query Stage Preparation Rules.
- Adaptive Query Execution Runtime Optimizer Rules.
- Adaptive Query Stage Optimizer Rules.
 The extensions can be used by calling withExtensionson the SparkSession.Builder, for example:SparkSession.builder() .master("...") .config("...", true) .withExtensions { extensions => extensions.injectResolutionRule { session => ... } extensions.injectParser { (session, parser) => ... } } .getOrCreate() The extensions can also be used by setting the Spark SQL configuration property spark.sql.extensions. Multiple extensions can be set using a comma-separated list. For example:SparkSession.builder() .master("...") .config("spark.sql.extensions", "org.example.MyExtensions,org.example.YourExtensions") .getOrCreate() class MyExtensions extends Function1[SparkSessionExtensions, Unit] { override def apply(extensions: SparkSessionExtensions): Unit = { extensions.injectResolutionRule { session => ... } extensions.injectParser { (session, parser) => ... } } } class YourExtensions extends SparkSessionExtensionsProvider { override def apply(extensions: SparkSessionExtensions): Unit = { extensions.injectResolutionRule { session => ... } extensions.injectFunction(...) } } Note that none of the injected builders should assume that the SparkSession is fully initialized and should not touch the session's internals (e.g. the SessionState). - Annotations
- @DeveloperApi() @Experimental() @Unstable()
 
-    trait SparkSessionExtensionsProvider extends (SparkSessionExtensions) => UnitBase trait for implementations used by SparkSessionExtensions Base trait for implementations used by SparkSessionExtensions For example, now we have an external function named Ageto register as an extension for SparkSession:package org.apache.spark.examples.extensions import org.apache.spark.sql.catalyst.expressions.{CurrentDate, Expression, RuntimeReplaceable, SubtractDates} case class Age(birthday: Expression, child: Expression) extends RuntimeReplaceable { def this(birthday: Expression) = this(birthday, SubtractDates(CurrentDate(), birthday)) override def exprsReplaced: Seq[Expression] = Seq(birthday) override protected def withNewChildInternal(newChild: Expression): Expression = copy(newChild) } We need to create our extension which inherits SparkSessionExtensionsProvider Example: package org.apache.spark.examples.extensions import org.apache.spark.sql.{SparkSessionExtensions, SparkSessionExtensionsProvider} import org.apache.spark.sql.catalyst.FunctionIdentifier import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionInfo} class MyExtensions extends SparkSessionExtensionsProvider { override def apply(v1: SparkSessionExtensions): Unit = { v1.injectFunction( (new FunctionIdentifier("age"), new ExpressionInfo(classOf[Age].getName, "age"), (children: Seq[Expression]) => new Age(children.head))) } } Then, we can inject MyExtensionsin three ways,- withExtensions of SparkSession.Builder
- Config - spark.sql.extensions
- java.util.ServiceLoader - Add to src/main/resources/META-INF/services/org.apache.spark.sql.SparkSessionExtensionsProvider
 - Annotations
- @DeveloperApi() @Since("3.2.0")
- Since
- 3.2.0 
- See also
 
-   abstract  class TableValuedFunction extends AnyRefInterface for invoking table-valued functions in Spark SQL. Interface for invoking table-valued functions in Spark SQL. - Since
- 4.0.0 
 
-    class TypedColumn[-T, U] extends ColumnA Column where an Encoder has been given for the expected input and return type. A Column where an Encoder has been given for the expected input and return type. To create a TypedColumn, use the asfunction on a Column.- T
- The input type expected for this expression. Can be - Anyif the expression is type checked by the analyzer instead of the compiler (i.e.- expr("sum(...)")).
- U
- The output type of this column. 
 - Annotations
- @Stable()
- Since
- 1.6.0 
 
-   abstract  class UDFRegistration extends AnyRefFunctions for registering user-defined functions. Functions for registering user-defined functions. Use SparkSession.udfto access this:spark.udf - Since
- 4.0.0 
 
-    case class WhenMatched[T] extends Product with SerializableA class for defining actions to be taken when matching rows in a DataFrame during a merge operation. A class for defining actions to be taken when matching rows in a DataFrame during a merge operation. - T
- The type of data in the MergeIntoWriter. 
 
-    case class WhenNotMatched[T] extends Product with SerializableA class for defining actions to be taken when no matching rows are found in a DataFrame during a merge operation. A class for defining actions to be taken when no matching rows are found in a DataFrame during a merge operation. - T
- The type of data in the MergeIntoWriter. 
 
-    case class WhenNotMatchedBySource[T] extends Product with SerializableA class for defining actions to be performed when there is no match by source during a merge operation in a MergeIntoWriter. A class for defining actions to be performed when there is no match by source during a merge operation in a MergeIntoWriter. - T
- the type parameter for the MergeIntoWriter. 
 
-    trait WriteConfigMethods[R] extends AnyRefConfiguration methods common to create/replace operations and insert/overwrite operations. Configuration methods common to create/replace operations and insert/overwrite operations. - R
- builder type to return 
 - Since
- 3.0.0 
 
Value Members
-    object EncodersMethods for creating an Encoder. Methods for creating an Encoder. - Since
- 1.6.0 
 
-    object Observation(Scala-specific) Create instances of Observation via Scala apply.(Scala-specific) Create instances of Observation via Scala apply.- Since
- 3.3.0 
 
-    object Row extends Serializable- Annotations
- @Stable()
- Since
- 1.3.0 
 
-  object SQLContext extends SQLContextCompanion with Serializable
-  object SparkSession extends SparkSessionCompanion with Serializable
-    object functionsCommonly used functions available for DataFrame operations. Commonly used functions available for DataFrame operations. Using functions defined here provides a little bit more compile-time safety to make sure the function exists. You can call the functions defined here by two ways: _FUNC_(...)andfunctions.expr("_FUNC_(...)").As an example, regr_countis a function that is defined here. You can useregr_count(col("yCol", col("xCol")))to invoke theregr_countfunction. This way the programming language's compiler ensuresregr_countexists and is of the proper form. You can also useexpr("regr_count(yCol, xCol)")function to invoke the same function. In this case, Spark itself will ensureregr_countexists when it analyzes the query.You can find the entire list of functions at SQL API documentation of your Spark version, see also the latest list This function APIs usually have methods with Columnsignature only because it can support not onlyColumnbut also other types such as a native string. The other variants currently exist for historical reasons.- Annotations
- @Stable()
- Since
- 1.3.0