package api
Contains API classes that are specific to a single language (i.e. Java).
- Source
- package.scala
Type Members
- abstract class Catalog[DS[U] <: Dataset[U, DS]] extends AnyRef
Catalog interface for Spark.
Catalog interface for Spark. To access this, use
SparkSession.catalog
.- Annotations
- @Stable()
- Since
2.0.0
- abstract class DataFrameNaFunctions[DS[U] <: Dataset[U, DS]] extends AnyRef
Functionality for working with missing data in
DataFrame
s.Functionality for working with missing data in
DataFrame
s.- Annotations
- @Stable()
- Since
1.3.1
- abstract class DataFrameReader[DS[U] <: Dataset[U, DS]] extends AnyRef
Interface used to load a Dataset from external storage systems (e.g.
Interface used to load a Dataset from external storage systems (e.g. file systems, key-value stores, etc). Use
SparkSession.read
to access this.- Annotations
- @Stable()
- Since
1.4.0
- abstract class DataFrameStatFunctions[DS[U] <: Dataset[U, DS]] extends AnyRef
Statistic functions for
DataFrame
s.Statistic functions for
DataFrame
s.- Annotations
- @Stable()
- Since
1.4.0
- abstract class Dataset[T, DS[U] <: Dataset[U, DS]] extends Serializable
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a
DataFrame
, which is a Dataset of org.apache.spark.sql.Row.Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (
groupBy
). Example actions count, show, or writing data out to file systems.Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. To explore the logical plan as well as optimized physical plan, use the
explain
function.To efficiently support domain-specific objects, an org.apache.spark.sql.Encoder is required. The encoder maps the domain specific type
T
to Spark's internal type system. For example, given a classPerson
with two fields,name
(string) andage
(int), an encoder is used to tell Spark to generate code at runtime to serialize thePerson
object into a binary structure. This binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing (e.g. in a columnar format). To understand the internal binary representation for data, use theschema
function.There are typically two ways to create a Dataset. The most common way is by pointing Spark to some files on storage systems, using the
read
function available on aSparkSession
.val people = spark.read.parquet("...").as[Person] // Scala Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java
Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:
val names = people.map(_.name) // in Scala; names is a Dataset[String] Dataset<String> names = people.map( (MapFunction<Person, String>) p -> p.name, Encoders.STRING()); // Java
Dataset operations can also be untyped, through various domain-specific-language (DSL) functions defined in: Dataset (this class), org.apache.spark.sql.Column, and org.apache.spark.sql.functions. These operations are very similar to the operations available in the data frame abstraction in R or Python.
To select a column from the Dataset, use
apply
method in Scala andcol
in Java.val ageCol = people("age") // in Scala Column ageCol = people.col("age"); // in Java
Note that the org.apache.spark.sql.Column type can also be manipulated through its various functions.
// The following creates a new column that increases everybody's age by 10. people("age") + 10 // in Scala people.col("age").plus(10); // in Java
A more concrete example in Scala:
// To create Dataset[Row] using SparkSession val people = spark.read.parquet("...") val department = spark.read.parquet("...") people.filter("age > 30") .join(department, people("deptId") === department("id")) .groupBy(department("name"), people("gender")) .agg(avg(people("salary")), max(people("age")))
and in Java:
// To create Dataset<Row> using SparkSession Dataset<Row> people = spark.read().parquet("..."); Dataset<Row> department = spark.read().parquet("..."); people.filter(people.col("age").gt(30)) .join(department, people.col("deptId").equalTo(department.col("id"))) .groupBy(department.col("name"), people.col("gender")) .agg(avg(people.col("salary")), max(people.col("age")));
- Annotations
- @Stable()
- Since
1.6.0
- abstract class KeyValueGroupedDataset[K, V, DS[U] <: Dataset[U, DS]] extends Serializable
A Dataset has been logically grouped by a user specified grouping key.
A Dataset has been logically grouped by a user specified grouping key. Users should not construct a KeyValueGroupedDataset directly, but should instead call
groupByKey
on an existing Dataset.- Since
2.0.0
- abstract class RelationalGroupedDataset[DS[U] <: Dataset[U, DS]] extends AnyRef
A set of methods for aggregations on a
DataFrame
, created by groupBy, cube or rollup (and alsopivot
).A set of methods for aggregations on a
DataFrame
, created by groupBy, cube or rollup (and alsopivot
).The main method is the
agg
function, which has multiple variants. This class also contains some first-order statistics such asmean
,sum
for convenience.- Annotations
- @Stable()
- Since
2.0.0
- Note
This class was named
GroupedData
in Spark 1.x.
- abstract class SparkSession[DS[U] <: Dataset[U, DS]] extends Serializable with Closeable
The entry point to programming Spark with the Dataset and DataFrame API.
The entry point to programming Spark with the Dataset and DataFrame API.
In environments that this has been created upfront (e.g. REPL, notebooks), use the builder to get an existing session:
SparkSession.builder().getOrCreate()
The builder can also be used to create a new session:
SparkSession.builder .master("local") .appName("Word Count") .config("spark.some.config.option", "some-value") .getOrCreate()
- trait StreamingQuery[DS[U] <: Dataset[U, DS]] extends AnyRef
A handle to a query that is executing continuously in the background as new data arrives.
A handle to a query that is executing continuously in the background as new data arrives. All these methods are thread-safe.
- Annotations
- @Evolving()
- Since
2.0.0
- abstract class UDFRegistration extends AnyRef
Functions for registering user-defined functions.
Functions for registering user-defined functions. Use
SparkSession.udf
to access this:spark.udf
- Since
4.0.0