Class RelationalGroupedDataset
DataFrame
, created by groupBy
,
cube
or rollup
(and also pivot
).
The main method is the agg
function, which has multiple variants. This class also contains
some first-order statistics such as mean
, sum
for convenience.
- Since:
- 2.0.0
- Note:
- This class was named
GroupedData
in Spark 1.x.
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic class
To indicate it's the CUBEstatic class
To indicate it's the GroupBystatic class
static interface
The Grouping Typestatic class
static class
To indicate it's the ROLLUP -
Method Summary
Modifier and TypeMethodDescription(Java-specific) Compute aggregates by specifying a map from column name to aggregate methods.Compute aggregates by specifying a series of aggregate columns.Compute aggregates by specifying a series of aggregate columns.(Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods.agg
(scala.Tuple2<String, String> aggExpr, scala.collection.immutable.Seq<scala.Tuple2<String, String>> aggExprs) (Scala-specific) Compute aggregates by specifying the column names and aggregate methods.static RelationalGroupedDataset
apply
(Dataset<Row> df, scala.collection.immutable.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs, RelationalGroupedDataset.GroupType groupType) <K,
T> KeyValueGroupedDataset<K, T> Returns aKeyValueGroupedDataset
where the data is grouped by the grouping expressions of currentRelationalGroupedDataset
.Compute the mean value for each numeric columns for each group.Compute the mean value for each numeric columns for each group.count()
Count the number of rows for each group.Compute the max value for each numeric columns for each group.Compute the max value for each numeric columns for each group.Compute the average value for each numeric columns for each group.Compute the average value for each numeric columns for each group.Compute the min value for each numeric column for each group.Compute the min value for each numeric column for each group.Pivots a column of the currentDataFrame
and performs the specified aggregation.(Java-specific) Pivots a column of the currentDataFrame
and performs the specified aggregation.Pivots a column of the currentDataFrame
and performs the specified aggregation.Pivots a column of the currentDataFrame
and performs the specified aggregation.(Java-specific) Pivots a column of the currentDataFrame
and performs the specified aggregation.Pivots a column of the currentDataFrame
and performs the specified aggregation.Compute the sum for each numeric columns for each group.Compute the sum for each numeric columns for each group.toString()
-
Method Details
-
apply
public static RelationalGroupedDataset apply(Dataset<Row> df, scala.collection.immutable.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs, RelationalGroupedDataset.GroupType groupType) -
agg
Description copied from class:RelationalGroupedDataset
Compute aggregates by specifying a series of aggregate columns. Note that this function by default retains the grouping columns in its output. To not retain grouping columns, setspark.sql.retainGroupColumns
to false.The available aggregate methods are defined in
functions
.// Selects the age of the oldest employee and the aggregate expense for each department // Scala: import org.apache.spark.sql.functions._ df.groupBy("department").agg(max("age"), sum("expense")) // Java: import static org.apache.spark.sql.functions.*; df.groupBy("department").agg(max("age"), sum("expense"));
Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change to that behavior, set config variable
spark.sql.retainGroupColumns
tofalse
.// Scala, 1.3.x: df.groupBy("department").agg($"department", max("age"), sum("expense")) // Java, 1.3.x: df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
- Overrides:
agg
in classRelationalGroupedDataset<Dataset>
- Parameters:
expr
- (undocumented)exprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
mean
Description copied from class:RelationalGroupedDataset
Compute the average value for each numeric columns for each group. This is an alias foravg
. The resultingDataFrame
will also contain the grouping columns. When specified columns are given, only compute the average values for them.- Overrides:
mean
in classRelationalGroupedDataset<Dataset>
- Parameters:
colNames
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
max
Description copied from class:RelationalGroupedDataset
Compute the max value for each numeric columns for each group. The resultingDataFrame
will also contain the grouping columns. When specified columns are given, only compute the max values for them.- Overrides:
max
in classRelationalGroupedDataset<Dataset>
- Parameters:
colNames
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
avg
Description copied from class:RelationalGroupedDataset
Compute the mean value for each numeric columns for each group. The resultingDataFrame
will also contain the grouping columns. When specified columns are given, only compute the mean values for them.- Overrides:
avg
in classRelationalGroupedDataset<Dataset>
- Parameters:
colNames
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
min
Description copied from class:RelationalGroupedDataset
Compute the min value for each numeric column for each group. The resultingDataFrame
will also contain the grouping columns. When specified columns are given, only compute the min values for them.- Overrides:
min
in classRelationalGroupedDataset<Dataset>
- Parameters:
colNames
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
sum
Description copied from class:RelationalGroupedDataset
Compute the sum for each numeric columns for each group. The resultingDataFrame
will also contain the grouping columns. When specified columns are given, only compute the sum for them.- Overrides:
sum
in classRelationalGroupedDataset<Dataset>
- Parameters:
colNames
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
as
Description copied from class:RelationalGroupedDataset
Returns aKeyValueGroupedDataset
where the data is grouped by the grouping expressions of currentRelationalGroupedDataset
.- Specified by:
as
in classRelationalGroupedDataset<Dataset>
- Parameters:
evidence$1
- (undocumented)evidence$2
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
agg
public Dataset<Row> agg(scala.Tuple2<String, String> aggExpr, scala.collection.immutable.Seq<scala.Tuple2<String, String>> aggExprs) Description copied from class:RelationalGroupedDataset
(Scala-specific) Compute aggregates by specifying the column names and aggregate methods. The resultingDataFrame
will also contain the grouping columns.The available aggregate methods are
avg
,max
,min
,sum
,count
.// Selects the age of the oldest employee and the aggregate expense for each department df.groupBy("department").agg( "age" -> "max", "expense" -> "sum" )
- Overrides:
agg
in classRelationalGroupedDataset<Dataset>
- Parameters:
aggExpr
- (undocumented)aggExprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
agg
Description copied from class:RelationalGroupedDataset
(Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resultingDataFrame
will also contain the grouping columns.The available aggregate methods are
avg
,max
,min
,sum
,count
.// Selects the age of the oldest employee and the aggregate expense for each department df.groupBy("department").agg(Map( "age" -> "max", "expense" -> "sum" ))
- Overrides:
agg
in classRelationalGroupedDataset<Dataset>
- Parameters:
exprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
agg
Description copied from class:RelationalGroupedDataset
(Java-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resultingDataFrame
will also contain the grouping columns.The available aggregate methods are
avg
,max
,min
,sum
,count
.// Selects the age of the oldest employee and the aggregate expense for each department import com.google.common.collect.ImmutableMap; df.groupBy("department").agg(ImmutableMap.of("age", "max", "expense", "sum"));
- Overrides:
agg
in classRelationalGroupedDataset<Dataset>
- Parameters:
exprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
agg
Description copied from class:RelationalGroupedDataset
Compute aggregates by specifying a series of aggregate columns. Note that this function by default retains the grouping columns in its output. To not retain grouping columns, setspark.sql.retainGroupColumns
to false.The available aggregate methods are defined in
functions
.// Selects the age of the oldest employee and the aggregate expense for each department // Scala: import org.apache.spark.sql.functions._ df.groupBy("department").agg(max("age"), sum("expense")) // Java: import static org.apache.spark.sql.functions.*; df.groupBy("department").agg(max("age"), sum("expense"));
Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change to that behavior, set config variable
spark.sql.retainGroupColumns
tofalse
.// Scala, 1.3.x: df.groupBy("department").agg($"department", max("age"), sum("expense")) // Java, 1.3.x: df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
- Overrides:
agg
in classRelationalGroupedDataset<Dataset>
- Parameters:
expr
- (undocumented)exprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
count
Description copied from class:RelationalGroupedDataset
Count the number of rows for each group. The resultingDataFrame
will also contain the grouping columns.- Overrides:
count
in classRelationalGroupedDataset<Dataset>
- Returns:
- (undocumented)
- Inheritdoc:
-
mean
Description copied from class:RelationalGroupedDataset
Compute the average value for each numeric columns for each group. This is an alias foravg
. The resultingDataFrame
will also contain the grouping columns. When specified columns are given, only compute the average values for them.- Overrides:
mean
in classRelationalGroupedDataset<Dataset>
- Parameters:
colNames
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
max
Description copied from class:RelationalGroupedDataset
Compute the max value for each numeric columns for each group. The resultingDataFrame
will also contain the grouping columns. When specified columns are given, only compute the max values for them.- Overrides:
max
in classRelationalGroupedDataset<Dataset>
- Parameters:
colNames
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
avg
Description copied from class:RelationalGroupedDataset
Compute the mean value for each numeric columns for each group. The resultingDataFrame
will also contain the grouping columns. When specified columns are given, only compute the mean values for them.- Overrides:
avg
in classRelationalGroupedDataset<Dataset>
- Parameters:
colNames
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
min
Description copied from class:RelationalGroupedDataset
Compute the min value for each numeric column for each group. The resultingDataFrame
will also contain the grouping columns. When specified columns are given, only compute the min values for them.- Overrides:
min
in classRelationalGroupedDataset<Dataset>
- Parameters:
colNames
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
sum
Description copied from class:RelationalGroupedDataset
Compute the sum for each numeric columns for each group. The resultingDataFrame
will also contain the grouping columns. When specified columns are given, only compute the sum for them.- Overrides:
sum
in classRelationalGroupedDataset<Dataset>
- Parameters:
colNames
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
pivot
Description copied from class:RelationalGroupedDataset
Pivots a column of the currentDataFrame
and performs the specified aggregation.Spark will eagerly compute the distinct values in
pivotColumn
so it can determine the resulting schema of the transformation. To avoid any eager computations, provide an explicit list of values viapivot(pivotColumn: String, values: Seq[Any])
.// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course").sum("earnings")
- Overrides:
pivot
in classRelationalGroupedDataset<Dataset>
- Parameters:
pivotColumn
- Name of the column to pivot.- Returns:
- (undocumented)
- See Also:
-
org.apache.spark.sql.Dataset.unpivot
for the reverse operation, except for the aggregation.
- Inheritdoc:
-
pivot
public RelationalGroupedDataset pivot(String pivotColumn, scala.collection.immutable.Seq<Object> values) Description copied from class:RelationalGroupedDataset
Pivots a column of the currentDataFrame
and performs the specified aggregation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings") // Or without specifying column values (less efficient) df.groupBy("year").pivot("course").sum("earnings")
From Spark 3.0.0, values can be literal columns, for instance, struct. For pivoting by multiple columns, use the
struct
function to combine the columns and values:df.groupBy("year") .pivot("trainingCourse", Seq(struct(lit("java"), lit("Experts")))) .agg(sum($"earnings"))
- Overrides:
pivot
in classRelationalGroupedDataset<Dataset>
- Parameters:
pivotColumn
- Name of the column to pivot.values
- List of values that will be translated to columns in the output DataFrame.- Returns:
- (undocumented)
- See Also:
-
org.apache.spark.sql.Dataset.unpivot
for the reverse operation, except for the aggregation.
- Inheritdoc:
-
pivot
Description copied from class:RelationalGroupedDataset
(Java-specific) Pivots a column of the currentDataFrame
and performs the specified aggregation.There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.
// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course", Arrays.<Object>asList("dotNET", "Java")).sum("earnings"); // Or without specifying column values (less efficient) df.groupBy("year").pivot("course").sum("earnings");
- Overrides:
pivot
in classRelationalGroupedDataset<Dataset>
- Parameters:
pivotColumn
- Name of the column to pivot.values
- List of values that will be translated to columns in the output DataFrame.- Returns:
- (undocumented)
- See Also:
-
org.apache.spark.sql.Dataset.unpivot
for the reverse operation, except for the aggregation.
- Inheritdoc:
-
pivot
Description copied from class:RelationalGroupedDataset
(Java-specific) Pivots a column of the currentDataFrame
and performs the specified aggregation. This is an overloaded version of thepivot
method withpivotColumn
of theString
type.- Overrides:
pivot
in classRelationalGroupedDataset<Dataset>
- Parameters:
pivotColumn
- the column to pivot.values
- List of values that will be translated to columns in the output DataFrame.- Returns:
- (undocumented)
- See Also:
-
org.apache.spark.sql.Dataset.unpivot
for the reverse operation, except for the aggregation.
- Inheritdoc:
-
pivot
Description copied from class:RelationalGroupedDataset
Pivots a column of the currentDataFrame
and performs the specified aggregation.Spark will eagerly compute the distinct values in
pivotColumn
so it can determine the resulting schema of the transformation. To avoid any eager computations, provide an explicit list of values viapivot(pivotColumn: Column, values: Seq[Any])
.// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy($"year").pivot($"course").sum($"earnings");
- Specified by:
pivot
in classRelationalGroupedDataset<Dataset>
- Parameters:
pivotColumn
- he column to pivot.- Returns:
- (undocumented)
- See Also:
-
org.apache.spark.sql.Dataset.unpivot
for the reverse operation, except for the aggregation.
- Inheritdoc:
-
pivot
public RelationalGroupedDataset pivot(Column pivotColumn, scala.collection.immutable.Seq<Object> values) Description copied from class:RelationalGroupedDataset
Pivots a column of the currentDataFrame
and performs the specified aggregation. This is an overloaded version of thepivot
method withpivotColumn
of theString
type.// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings")
- Specified by:
pivot
in classRelationalGroupedDataset<Dataset>
- Parameters:
pivotColumn
- the column to pivot.values
- List of values that will be translated to columns in the output DataFrame.- Returns:
- (undocumented)
- See Also:
-
org.apache.spark.sql.Dataset.unpivot
for the reverse operation, except for the aggregation.
- Inheritdoc:
-
toString
-