RelationalGroupedDataset

abstract class RelationalGroupedDataset extends AnyRef

A set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup (and also pivot).

The main method is the agg function, which has multiple variants. This class also contains some first-order statistics such as mean, sum for convenience.

Annotations: @Stable()
Source: RelationalGroupedDataset.scala
Since: 2.0.0
Note: This class was named GroupedData in Spark 1.x.

Linear Supertypes

AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

RelationalGroupedDataset
AnyRef
Any

Hide All
Show All

Visibility

Public
Protected

Instance Constructors

new RelationalGroupedDataset()

Abstract Value Members

abstract def as[K, T](implicit arg0: Encoder[K], arg1: Encoder[T]): KeyValueGroupedDataset[K, T]
Returns a KeyValueGroupedDataset where the data is grouped by the grouping expressions of current RelationalGroupedDataset.
Returns a KeyValueGroupedDataset where the data is grouped by the grouping expressions of current RelationalGroupedDataset.
Since
3.0.0
abstract def df: DataFrame
Attributes
protected
abstract def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset
Pivots a column of the current DataFrame and performs the specified aggregation.
Pivots a column of the current DataFrame and performs the specified aggregation. This is an overloaded version of the pivot method with pivotColumn of the String type.
```
// Compute the sum of earnings for each year by course with each course as a separate column
df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings")
```
pivotColumn
the column to pivot.
values
List of values that will be translated to columns in the output DataFrame.
Since
2.4.0
See also
org.apache.spark.sql.Dataset.unpivot for the reverse operation, except for the aggregation.
abstract def pivot(pivotColumn: Column): RelationalGroupedDataset
Pivots a column of the current DataFrame and performs the specified aggregation.
Pivots a column of the current DataFrame and performs the specified aggregation.
Spark will eagerly compute the distinct values in pivotColumn so it can determine the resulting schema of the transformation. To avoid any eager computations, provide an explicit list of values via pivot(pivotColumn: Column, values: Seq[Any]).
```
// Compute the sum of earnings for each year by course with each course as a separate column
df.groupBy($"year").pivot($"course").sum($"earnings");
```
pivotColumn
he column to pivot.
Since
2.4.0
See also
org.apache.spark.sql.Dataset.unpivot for the reverse operation, except for the aggregation.
abstract def selectNumericColumns(colNames: Seq[String]): Seq[Column]
Attributes
protected
abstract def toDF(aggCols: Seq[Column]): DataFrame
Create a aggregation based on the grouping column, the grouping type, and the aggregations.
Create a aggregation based on the grouping column, the grouping type, and the aggregations.
Attributes
protected

Concrete Value Members

final def !=(arg0: Any): Boolean
Definition Classes
AnyRef → Any
final def ##: Int
Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean
Definition Classes
AnyRef → Any
def agg(expr: Column, exprs: Column*): DataFrame
Compute aggregates by specifying a series of aggregate columns.
Compute aggregates by specifying a series of aggregate columns. Note that this function by default retains the grouping columns in its output. To not retain grouping columns, set spark.sql.retainGroupColumns to false.
The available aggregate methods are defined in org.apache.spark.sql.functions.
```
// Selects the age of the oldest employee and the aggregate expense for each department

// Scala:
import org.apache.spark.sql.functions._
df.groupBy("department").agg(max("age"), sum("expense"))

// Java:
import static org.apache.spark.sql.functions.*;
df.groupBy("department").agg(max("age"), sum("expense"));
```
Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change to that behavior, set config variable spark.sql.retainGroupColumns to false.
```
// Scala, 1.3.x:
df.groupBy("department").agg($"department", max("age"), sum("expense"))

// Java, 1.3.x:
df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
```
Annotations
@varargs()
Since
1.3.0
def agg(exprs: Map[String, String]): DataFrame
(Java-specific) Compute aggregates by specifying a map from column name to aggregate methods.
(Java-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting DataFrame will also contain the grouping columns.
The available aggregate methods are avg, max, min, sum, count.
```
// Selects the age of the oldest employee and the aggregate expense for each department
import com.google.common.collect.ImmutableMap;
df.groupBy("department").agg(ImmutableMap.of("age", "max", "expense", "sum"));
```
Since
1.3.0
def agg(exprs: Map[String, String]): DataFrame
(Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods.
(Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting DataFrame will also contain the grouping columns.
The available aggregate methods are avg, max, min, sum, count.
```
// Selects the age of the oldest employee and the aggregate expense for each department
df.groupBy("department").agg(Map(
  "age" -> "max",
  "expense" -> "sum"
))
```
Since
1.3.0
def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
(Scala-specific) Compute aggregates by specifying the column names and aggregate methods.
(Scala-specific) Compute aggregates by specifying the column names and aggregate methods. The resulting DataFrame will also contain the grouping columns.
The available aggregate methods are avg, max, min, sum, count.
```
// Selects the age of the oldest employee and the aggregate expense for each department
df.groupBy("department").agg(
  "age" -> "max",
  "expense" -> "sum"
)
```
Since
1.3.0
final def asInstanceOf[T0]: T0
Definition Classes
Any
def avg(colNames: String*): DataFrame
Compute the mean value for each numeric columns for each group.
Compute the mean value for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the mean values for them.
Annotations
@varargs()
Since
1.3.0
def clone(): AnyRef
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
def count(): DataFrame
Count the number of rows for each group.
Count the number of rows for each group. The resulting DataFrame will also contain the grouping columns.
Since
1.3.0
final def eq(arg0: AnyRef): Boolean
Definition Classes
AnyRef
def equals(arg0: AnyRef): Boolean
Definition Classes
AnyRef → Any
final def getClass(): Class[_ <: AnyRef]
Definition Classes
AnyRef → Any
Annotations
@IntrinsicCandidate() @native()
def hashCode(): Int
Definition Classes
AnyRef → Any
Annotations
@IntrinsicCandidate() @native()
final def isInstanceOf[T0]: Boolean
Definition Classes
Any
def max(colNames: String*): DataFrame
Compute the max value for each numeric columns for each group.
Compute the max value for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the max values for them.
Annotations
@varargs()
Since
1.3.0
def mean(colNames: String*): DataFrame
Compute the average value for each numeric columns for each group.
Compute the average value for each numeric columns for each group. This is an alias for avg. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the average values for them.
Annotations
@varargs()
Since
1.3.0
def min(colNames: String*): DataFrame
Compute the min value for each numeric column for each group.
Compute the min value for each numeric column for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the min values for them.
Annotations
@varargs()
Since
1.3.0
final def ne(arg0: AnyRef): Boolean
Definition Classes
AnyRef
final def notify(): Unit
Definition Classes
AnyRef
Annotations
@IntrinsicCandidate() @native()
final def notifyAll(): Unit
Definition Classes
AnyRef
Annotations
@IntrinsicCandidate() @native()
def pivot(pivotColumn: Column, values: List[Any]): RelationalGroupedDataset
(Java-specific) Pivots a column of the current DataFrame and performs the specified aggregation.
(Java-specific) Pivots a column of the current DataFrame and performs the specified aggregation. This is an overloaded version of the pivot method with pivotColumn of the String type.
pivotColumn
the column to pivot.
values
List of values that will be translated to columns in the output DataFrame.
Since
2.4.0
See also
org.apache.spark.sql.Dataset.unpivot for the reverse operation, except for the aggregation.
def pivot(pivotColumn: String, values: List[Any]): RelationalGroupedDataset
(Java-specific) Pivots a column of the current DataFrame and performs the specified aggregation.
(Java-specific) Pivots a column of the current DataFrame and performs the specified aggregation.
There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.
```
// Compute the sum of earnings for each year by course with each course as a separate column
df.groupBy("year").pivot("course", Arrays.<Object>asList("dotNET", "Java")).sum("earnings");

// Or without specifying column values (less efficient)
df.groupBy("year").pivot("course").sum("earnings");
```
pivotColumn
Name of the column to pivot.
values
List of values that will be translated to columns in the output DataFrame.
Since
1.6.0
See also
org.apache.spark.sql.Dataset.unpivot for the reverse operation, except for the aggregation.
def pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset
Pivots a column of the current DataFrame and performs the specified aggregation.
Pivots a column of the current DataFrame and performs the specified aggregation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.
```
// Compute the sum of earnings for each year by course with each course as a separate column
df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings")

// Or without specifying column values (less efficient)
df.groupBy("year").pivot("course").sum("earnings")
```
From Spark 3.0.0, values can be literal columns, for instance, struct. For pivoting by multiple columns, use the struct function to combine the columns and values:
```
df.groupBy("year")
  .pivot("trainingCourse", Seq(struct(lit("java"), lit("Experts"))))
  .agg(sum($"earnings"))
```
pivotColumn
Name of the column to pivot.
values
List of values that will be translated to columns in the output DataFrame.
Since
1.6.0
See also
org.apache.spark.sql.Dataset.unpivot for the reverse operation, except for the aggregation.
def pivot(pivotColumn: String): RelationalGroupedDataset
Pivots a column of the current DataFrame and performs the specified aggregation.
Pivots a column of the current DataFrame and performs the specified aggregation.
Spark will eagerly compute the distinct values in pivotColumn so it can determine the resulting schema of the transformation. To avoid any eager computations, provide an explicit list of values via pivot(pivotColumn: String, values: Seq[Any]).
```
// Compute the sum of earnings for each year by course with each course as a separate column
df.groupBy("year").pivot("course").sum("earnings")
```
pivotColumn
Name of the column to pivot.
Since
1.6.0
See also
org.apache.spark.sql.Dataset.unpivot for the reverse operation, except for the aggregation.
def sum(colNames: String*): DataFrame
Compute the sum for each numeric columns for each group.
Compute the sum for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the sum for them.
Annotations
@varargs()
Since
1.3.0
final def synchronized[T0](arg0: => T0): T0
Definition Classes
AnyRef
def toString(): String
Definition Classes
AnyRef → Any
final def wait(arg0: Long, arg1: Int): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
final def wait(arg0: Long): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException]) @native()
final def wait(): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])

Deprecated Value Members

def finalize(): Unit
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.Throwable]) @Deprecated
Deprecated
(Since version 9)

Packages

RelationalGroupedDataset

abstract class RelationalGroupedDataset extends AnyRef

Instance Constructors

Abstract Value Members

Concrete Value Members

Deprecated Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped

Packages

RelationalGroupedDataset

abstract class RelationalGroupedDataset extends AnyRef

Instance Constructors

Abstract Value Members

Concrete Value Members

Deprecated Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped

RelationalGroupedDataset