org.apache.spark.sql.RelationalGroupedDataset

public class RelationalGroupedDataset extends Object

A set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup (and also pivot).

The main method is the agg function, which has multiple variants. This class also contains some first-order statistics such as mean, sum for convenience.

Since:: 2.0.0
Note:: This class was named GroupedData in Spark 1.x.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

RelationalGroupedDataset.CubeType$

To indicate it's the CUBE

static class

RelationalGroupedDataset.GroupByType$

To indicate it's the GroupBy

static class

RelationalGroupedDataset.GroupingSetsType$

static interface

RelationalGroupedDataset.GroupType

The Grouping Type

static class

RelationalGroupedDataset.PivotType$

static class

RelationalGroupedDataset.RollupType$

To indicate it's the ROLLUP
Method Summary

Modifier and Type

Method

Description

Dataset<Row>

agg(Map<String,String> exprs)

(Java-specific) Compute aggregates by specifying a map from column name to aggregate methods.

Dataset<Row>

agg(Column expr, Column... exprs)

Compute aggregates by specifying a series of aggregate columns.

Dataset<Row>

agg(Column expr, scala.collection.immutable.Seq<Column> exprs)

Compute aggregates by specifying a series of aggregate columns.

Dataset<Row>

agg(scala.collection.immutable.Map<String,String> exprs)

(Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods.

Dataset<Row>

agg(scala.Tuple2<String,String> aggExpr, scala.collection.immutable.Seq<scala.Tuple2<String,String>> aggExprs)

(Scala-specific) Compute aggregates by specifying the column names and aggregate methods.

static RelationalGroupedDataset

apply(Dataset<Row> df, scala.collection.immutable.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs, RelationalGroupedDataset.GroupType groupType)

<K, T> KeyValueGroupedDataset<K,T>

as(Encoder<K> evidence$1, Encoder<T> evidence$2)

Returns a KeyValueGroupedDataset where the data is grouped by the grouping expressions of current RelationalGroupedDataset.

Dataset<Row>

avg(String... colNames)

Compute the mean value for each numeric columns for each group.

Dataset<Row>

avg(scala.collection.immutable.Seq<String> colNames)

Compute the mean value for each numeric columns for each group.

Dataset<Row>

count()

Count the number of rows for each group.

Dataset<Row>

max(String... colNames)

Compute the max value for each numeric columns for each group.

Dataset<Row>

max(scala.collection.immutable.Seq<String> colNames)

Compute the max value for each numeric columns for each group.

Dataset<Row>

mean(String... colNames)

Compute the average value for each numeric columns for each group.

Dataset<Row>

mean(scala.collection.immutable.Seq<String> colNames)

Compute the average value for each numeric columns for each group.

Dataset<Row>

min(String... colNames)

Compute the min value for each numeric column for each group.

Dataset<Row>

min(scala.collection.immutable.Seq<String> colNames)

Compute the min value for each numeric column for each group.

RelationalGroupedDataset

pivot(String pivotColumn)

Pivots a column of the current DataFrame and performs the specified aggregation.

RelationalGroupedDataset

pivot(String pivotColumn, List<Object> values)

(Java-specific) Pivots a column of the current DataFrame and performs the specified aggregation.

RelationalGroupedDataset

pivot(String pivotColumn, scala.collection.immutable.Seq<Object> values)

Pivots a column of the current DataFrame and performs the specified aggregation.

RelationalGroupedDataset

pivot(Column pivotColumn)

Pivots a column of the current DataFrame and performs the specified aggregation.

RelationalGroupedDataset

pivot(Column pivotColumn, List<Object> values)

(Java-specific) Pivots a column of the current DataFrame and performs the specified aggregation.

RelationalGroupedDataset

pivot(Column pivotColumn, scala.collection.immutable.Seq<Object> values)

Pivots a column of the current DataFrame and performs the specified aggregation.

Dataset<Row>

sum(String... colNames)

Compute the sum for each numeric columns for each group.

Dataset<Row>

sum(scala.collection.immutable.Seq<String> colNames)

Compute the sum for each numeric columns for each group.

String

toString()

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

Method Details
- apply
  
  public static RelationalGroupedDataset apply(Dataset<Row> df, scala.collection.immutable.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs, RelationalGroupedDataset.GroupType groupType)
- agg
  
  public Dataset<Row> agg(Column expr, Column... exprs)
  Compute aggregates by specifying a series of aggregate columns. Note that this function by default retains the grouping columns in its output. To not retain grouping columns, set spark.sql.retainGroupColumns to false.
  The available aggregate methods are defined in functions.
  
  // Selects the age of the oldest employee and the aggregate expense for each department // Scala: import org.apache.spark.sql.functions._ df.groupBy("department").agg(max("age"), sum("expense")) // Java: import static org.apache.spark.sql.functions.*; df.groupBy("department").agg(max("age"), sum("expense"));
  
  Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change to that behavior, set config variable spark.sql.retainGroupColumns to false.
  // Scala, 1.3.x: df.groupBy("department").agg($"department", max("age"), sum("expense")) // Java, 1.3.x: df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
  Parameters:
  
  expr - (undocumented)
  
  exprs - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.3.0
- mean
  
  public Dataset<Row> mean(String... colNames)
  
  Compute the average value for each numeric columns for each group. This is an alias for avg. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the average values for them.
  
  Parameters:
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.3.0
- max
  
  public Dataset<Row> max(String... colNames)
  
  Compute the max value for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the max values for them.
  
  Parameters:
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.3.0
- avg
  
  public Dataset<Row> avg(String... colNames)
  
  Compute the mean value for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the mean values for them.
  
  Parameters:
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.3.0
- min
  
  public Dataset<Row> min(String... colNames)
  
  Compute the min value for each numeric column for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the min values for them.
  
  Parameters:
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.3.0
- sum
  
  public Dataset<Row> sum(String... colNames)
  
  Compute the sum for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the sum for them.
  
  Parameters:
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.3.0
- as
  
  public <K, T> KeyValueGroupedDataset<K,T> as(Encoder<K> evidence$1, Encoder<T> evidence$2)
  
  Returns a KeyValueGroupedDataset where the data is grouped by the grouping expressions of current RelationalGroupedDataset.
  
  Parameters:
  
  evidence$1 - (undocumented)
  
  evidence$2 - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  3.0.0
- agg
  
  public Dataset<Row> agg(scala.Tuple2<String,String> aggExpr, scala.collection.immutable.Seq<scala.Tuple2<String,String>> aggExprs)
  (Scala-specific) Compute aggregates by specifying the column names and aggregate methods. The resulting DataFrame will also contain the grouping columns.
  The available aggregate methods are avg, max, min, sum, count.
  // Selects the age of the oldest employee and the aggregate expense for each department df.groupBy("department").agg( "age" -> "max", "expense" -> "sum" )
  Parameters:
  
  aggExpr - (undocumented)
  
  aggExprs - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.3.0
- agg
  
  public Dataset<Row> agg(scala.collection.immutable.Map<String,String> exprs)
  (Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting DataFrame will also contain the grouping columns.
  The available aggregate methods are avg, max, min, sum, count.
  // Selects the age of the oldest employee and the aggregate expense for each department df.groupBy("department").agg(Map( "age" -> "max", "expense" -> "sum" ))
  Parameters:
  
  exprs - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.3.0
- agg
  
  public Dataset<Row> agg(Map<String,String> exprs)
  (Java-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting DataFrame will also contain the grouping columns.
  The available aggregate methods are avg, max, min, sum, count.
  // Selects the age of the oldest employee and the aggregate expense for each department import com.google.common.collect.ImmutableMap; df.groupBy("department").agg(ImmutableMap.of("age", "max", "expense", "sum"));
  Parameters:
  
  exprs - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.3.0
- agg
  
  public Dataset<Row> agg(Column expr, scala.collection.immutable.Seq<Column> exprs)
  Compute aggregates by specifying a series of aggregate columns. Note that this function by default retains the grouping columns in its output. To not retain grouping columns, set spark.sql.retainGroupColumns to false.
  The available aggregate methods are defined in functions.
  
  // Selects the age of the oldest employee and the aggregate expense for each department // Scala: import org.apache.spark.sql.functions._ df.groupBy("department").agg(max("age"), sum("expense")) // Java: import static org.apache.spark.sql.functions.*; df.groupBy("department").agg(max("age"), sum("expense"));
  
  Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change to that behavior, set config variable spark.sql.retainGroupColumns to false.
  // Scala, 1.3.x: df.groupBy("department").agg($"department", max("age"), sum("expense")) // Java, 1.3.x: df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
  Parameters:
  
  expr - (undocumented)
  
  exprs - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.3.0
- count
  
  public Dataset<Row> count()
  
  Count the number of rows for each group. The resulting DataFrame will also contain the grouping columns.
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.3.0
- mean
  
  public Dataset<Row> mean(scala.collection.immutable.Seq<String> colNames)
  
  Compute the average value for each numeric columns for each group. This is an alias for avg. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the average values for them.
  
  Parameters:
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.3.0
- max
  
  public Dataset<Row> max(scala.collection.immutable.Seq<String> colNames)
  
  Compute the max value for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the max values for them.
  
  Parameters:
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.3.0
- avg
  
  public Dataset<Row> avg(scala.collection.immutable.Seq<String> colNames)
  
  Compute the mean value for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the mean values for them.
  
  Parameters:
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.3.0
- min
  
  public Dataset<Row> min(scala.collection.immutable.Seq<String> colNames)
  
  Compute the min value for each numeric column for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the min values for them.
  
  Parameters:
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.3.0
- sum
  
  public Dataset<Row> sum(scala.collection.immutable.Seq<String> colNames)
  
  Compute the sum for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the sum for them.
  
  Parameters:
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.3.0
- pivot
  
  public RelationalGroupedDataset pivot(String pivotColumn)
  Pivots a column of the current DataFrame and performs the specified aggregation.
  Spark will eagerly compute the distinct values in pivotColumn so it can determine the resulting schema of the transformation. To avoid any eager computations, provide an explicit list of values via pivot(pivotColumn: String, values: Seq[Any]).
  
  // Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course").sum("earnings")
  Parameters:
  
  pivotColumn - Name of the column to pivot.
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.6.0
  
  See Also:
  
  org.apache.spark.sql.Dataset.unpivot for the reverse operation, except for the aggregation.
- pivot
  
  public RelationalGroupedDataset pivot(String pivotColumn, scala.collection.immutable.Seq<Object> values)
  Pivots a column of the current DataFrame and performs the specified aggregation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.
  
  // Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings") // Or without specifying column values (less efficient) df.groupBy("year").pivot("course").sum("earnings")
  
  From Spark 3.0.0, values can be literal columns, for instance, struct. For pivoting by multiple columns, use the struct function to combine the columns and values:
  
  df.groupBy("year") .pivot("trainingCourse", Seq(struct(lit("java"), lit("Experts")))) .agg(sum($"earnings"))
  Parameters:
  
  pivotColumn - Name of the column to pivot.
  
  values - List of values that will be translated to columns in the output DataFrame.
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.6.0
  
  See Also:
  
  org.apache.spark.sql.Dataset.unpivot for the reverse operation, except for the aggregation.
- pivot
  
  public RelationalGroupedDataset pivot(String pivotColumn, List<Object> values)
  (Java-specific) Pivots a column of the current DataFrame and performs the specified aggregation.
  There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.
  
  // Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course", Arrays.<Object>asList("dotNET", "Java")).sum("earnings"); // Or without specifying column values (less efficient) df.groupBy("year").pivot("course").sum("earnings");
  Parameters:
  
  pivotColumn - Name of the column to pivot.
  
  values - List of values that will be translated to columns in the output DataFrame.
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.6.0
  
  See Also:
  
  org.apache.spark.sql.Dataset.unpivot for the reverse operation, except for the aggregation.
- pivot
  
  public RelationalGroupedDataset pivot(Column pivotColumn)
  Pivots a column of the current DataFrame and performs the specified aggregation.
  Spark will eagerly compute the distinct values in pivotColumn so it can determine the resulting schema of the transformation. To avoid any eager computations, provide an explicit list of values via pivot(pivotColumn: Column, values: Seq[Any]).
  
  // Compute the sum of earnings for each year by course with each course as a separate column df.groupBy($"year").pivot($"course").sum($"earnings");
  Parameters:
  
  pivotColumn - he column to pivot.
  
  Returns:
  
  (undocumented)
  
  Since:
  
  2.4.0
  
  See Also:
  
  org.apache.spark.sql.Dataset.unpivot for the reverse operation, except for the aggregation.
- pivot
  
  public RelationalGroupedDataset pivot(Column pivotColumn, scala.collection.immutable.Seq<Object> values)
  Pivots a column of the current DataFrame and performs the specified aggregation. This is an overloaded version of the pivot method with pivotColumn of the String type.
  
  // Compute the sum of earnings for each year by course with each course as a separate column df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings")
  Parameters:
  
  pivotColumn - the column to pivot.
  
  values - List of values that will be translated to columns in the output DataFrame.
  
  Returns:
  
  (undocumented)
  
  Since:
  
  2.4.0
  
  See Also:
  
  org.apache.spark.sql.Dataset.unpivot for the reverse operation, except for the aggregation.
- pivot
  
  public RelationalGroupedDataset pivot(Column pivotColumn, List<Object> values)
  
  (Java-specific) Pivots a column of the current DataFrame and performs the specified aggregation. This is an overloaded version of the pivot method with pivotColumn of the String type.
  Parameters:
  
  pivotColumn - the column to pivot.
  
  values - List of values that will be translated to columns in the output DataFrame.
  
  Returns:
  
  (undocumented)
  
  Since:
  
  2.4.0
  
  See Also:
  
  org.apache.spark.sql.Dataset.unpivot for the reverse operation, except for the aggregation.
- toString
  
  public String toString()
  
  Overrides:
  
  toString in class Object

Class RelationalGroupedDataset

Nested Class Summary

Method Summary

Methods inherited from class java.lang.Object

Method Details

apply

agg

mean

max

avg

min

sum

as

agg

agg

agg

agg

count

mean

max

avg

min

sum

pivot

pivot

pivot

pivot

pivot

pivot

toString