abstract class RelationalGroupedDataset extends AnyRef
A set of methods for aggregations on a DataFrame, created by groupBy,
cube or rollup (and also pivot).
The main method is the agg function, which has multiple variants. This class also contains
some first-order statistics such as mean, sum for convenience.
- Annotations
- @Stable()
- Source
- RelationalGroupedDataset.scala
- Since
- 2.0.0 
- Note
- This class was named - GroupedDatain Spark 1.x.
- Alphabetic
- By Inheritance
- RelationalGroupedDataset
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Instance Constructors
-  new RelationalGroupedDataset()
Abstract Value Members
-   abstract  def as[K, T](implicit arg0: Encoder[K], arg1: Encoder[T]): KeyValueGroupedDataset[K, T]Returns a KeyValueGroupedDatasetwhere the data is grouped by the grouping expressions of currentRelationalGroupedDataset.Returns a KeyValueGroupedDatasetwhere the data is grouped by the grouping expressions of currentRelationalGroupedDataset.- Since
- 3.0.0 
 
-   abstract  def df: DataFrame- Attributes
- protected
 
-   abstract  def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDatasetPivots a column of the current DataFrameand performs the specified aggregation.Pivots a column of the current DataFrameand performs the specified aggregation. This is an overloaded version of thepivotmethod withpivotColumnof theStringtype.// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings") - pivotColumn
- the column to pivot. 
- values
- List of values that will be translated to columns in the output DataFrame. 
 - Since
- 2.4.0 
- See also
- org.apache.spark.sql.Dataset.unpivotfor the reverse operation, except for the aggregation.
 
-   abstract  def pivot(pivotColumn: Column): RelationalGroupedDatasetPivots a column of the current DataFrameand performs the specified aggregation.Pivots a column of the current DataFrameand performs the specified aggregation.Spark will eagerly compute the distinct values in pivotColumnso it can determine the resulting schema of the transformation. To avoid any eager computations, provide an explicit list of values viapivot(pivotColumn: Column, values: Seq[Any]).// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy($"year").pivot($"course").sum($"earnings"); - pivotColumn
- he column to pivot. 
 - Since
- 2.4.0 
- See also
- org.apache.spark.sql.Dataset.unpivotfor the reverse operation, except for the aggregation.
 
-   abstract  def selectNumericColumns(colNames: Seq[String]): Seq[Column]- Attributes
- protected
 
-   abstract  def toDF(aggCols: Seq[Column]): DataFrameCreate a aggregation based on the grouping column, the grouping type, and the aggregations. Create a aggregation based on the grouping column, the grouping type, and the aggregations. - Attributes
- protected
 
Concrete Value Members
-   final  def !=(arg0: Any): Boolean- Definition Classes
- AnyRef → Any
 
-   final  def ##: Int- Definition Classes
- AnyRef → Any
 
-   final  def ==(arg0: Any): Boolean- Definition Classes
- AnyRef → Any
 
-    def agg(expr: Column, exprs: Column*): DataFrameCompute aggregates by specifying a series of aggregate columns. Compute aggregates by specifying a series of aggregate columns. Note that this function by default retains the grouping columns in its output. To not retain grouping columns, set spark.sql.retainGroupColumnsto false.The available aggregate methods are defined in org.apache.spark.sql.functions. // Selects the age of the oldest employee and the aggregate expense for each department // Scala: import org.apache.spark.sql.functions._ df.groupBy("department").agg(max("age"), sum("expense")) // Java: import static org.apache.spark.sql.functions.*; df.groupBy("department").agg(max("age"), sum("expense")); Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change to that behavior, set config variable spark.sql.retainGroupColumnstofalse.// Scala, 1.3.x: df.groupBy("department").agg($"department", max("age"), sum("expense")) // Java, 1.3.x: df.groupBy("department").agg(col("department"), max("age"), sum("expense")); - Annotations
- @varargs()
- Since
- 1.3.0 
 
-    def agg(exprs: Map[String, String]): DataFrame(Java-specific) Compute aggregates by specifying a map from column name to aggregate methods. (Java-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting DataFramewill also contain the grouping columns.The available aggregate methods are avg,max,min,sum,count.// Selects the age of the oldest employee and the aggregate expense for each department import com.google.common.collect.ImmutableMap; df.groupBy("department").agg(ImmutableMap.of("age", "max", "expense", "sum")); - Since
- 1.3.0 
 
-    def agg(exprs: Map[String, String]): DataFrame(Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods. (Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting DataFramewill also contain the grouping columns.The available aggregate methods are avg,max,min,sum,count.// Selects the age of the oldest employee and the aggregate expense for each department df.groupBy("department").agg(Map( "age" -> "max", "expense" -> "sum" )) - Since
- 1.3.0 
 
-    def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame(Scala-specific) Compute aggregates by specifying the column names and aggregate methods. (Scala-specific) Compute aggregates by specifying the column names and aggregate methods. The resulting DataFramewill also contain the grouping columns.The available aggregate methods are avg,max,min,sum,count.// Selects the age of the oldest employee and the aggregate expense for each department df.groupBy("department").agg( "age" -> "max", "expense" -> "sum" ) - Since
- 1.3.0 
 
-   final  def asInstanceOf[T0]: T0- Definition Classes
- Any
 
-    def avg(colNames: String*): DataFrameCompute the mean value for each numeric columns for each group. Compute the mean value for each numeric columns for each group. The resulting DataFramewill also contain the grouping columns. When specified columns are given, only compute the mean values for them.- Annotations
- @varargs()
- Since
- 1.3.0 
 
-    def clone(): AnyRef- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
 
-    def count(): DataFrameCount the number of rows for each group. Count the number of rows for each group. The resulting DataFramewill also contain the grouping columns.- Since
- 1.3.0 
 
-   final  def eq(arg0: AnyRef): Boolean- Definition Classes
- AnyRef
 
-    def equals(arg0: AnyRef): Boolean- Definition Classes
- AnyRef → Any
 
-   final  def getClass(): Class[_ <: AnyRef]- Definition Classes
- AnyRef → Any
- Annotations
- @IntrinsicCandidate() @native()
 
-    def hashCode(): Int- Definition Classes
- AnyRef → Any
- Annotations
- @IntrinsicCandidate() @native()
 
-   final  def isInstanceOf[T0]: Boolean- Definition Classes
- Any
 
-    def max(colNames: String*): DataFrameCompute the max value for each numeric columns for each group. Compute the max value for each numeric columns for each group. The resulting DataFramewill also contain the grouping columns. When specified columns are given, only compute the max values for them.- Annotations
- @varargs()
- Since
- 1.3.0 
 
-    def mean(colNames: String*): DataFrameCompute the average value for each numeric columns for each group. Compute the average value for each numeric columns for each group. This is an alias for avg. The resultingDataFramewill also contain the grouping columns. When specified columns are given, only compute the average values for them.- Annotations
- @varargs()
- Since
- 1.3.0 
 
-    def min(colNames: String*): DataFrameCompute the min value for each numeric column for each group. Compute the min value for each numeric column for each group. The resulting DataFramewill also contain the grouping columns. When specified columns are given, only compute the min values for them.- Annotations
- @varargs()
- Since
- 1.3.0 
 
-   final  def ne(arg0: AnyRef): Boolean- Definition Classes
- AnyRef
 
-   final  def notify(): Unit- Definition Classes
- AnyRef
- Annotations
- @IntrinsicCandidate() @native()
 
-   final  def notifyAll(): Unit- Definition Classes
- AnyRef
- Annotations
- @IntrinsicCandidate() @native()
 
-    def pivot(pivotColumn: Column, values: List[Any]): RelationalGroupedDataset(Java-specific) Pivots a column of the current DataFrameand performs the specified aggregation.(Java-specific) Pivots a column of the current DataFrameand performs the specified aggregation. This is an overloaded version of thepivotmethod withpivotColumnof theStringtype.- pivotColumn
- the column to pivot. 
- values
- List of values that will be translated to columns in the output DataFrame. 
 - Since
- 2.4.0 
- See also
- org.apache.spark.sql.Dataset.unpivotfor the reverse operation, except for the aggregation.
 
-    def pivot(pivotColumn: String, values: List[Any]): RelationalGroupedDataset(Java-specific) Pivots a column of the current DataFrameand performs the specified aggregation.(Java-specific) Pivots a column of the current DataFrameand performs the specified aggregation.There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally. // Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course", Arrays.<Object>asList("dotNET", "Java")).sum("earnings"); // Or without specifying column values (less efficient) df.groupBy("year").pivot("course").sum("earnings"); - pivotColumn
- Name of the column to pivot. 
- values
- List of values that will be translated to columns in the output DataFrame. 
 - Since
- 1.6.0 
- See also
- org.apache.spark.sql.Dataset.unpivotfor the reverse operation, except for the aggregation.
 
-    def pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDatasetPivots a column of the current DataFrameand performs the specified aggregation.Pivots a column of the current DataFrameand performs the specified aggregation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings") // Or without specifying column values (less efficient) df.groupBy("year").pivot("course").sum("earnings") From Spark 3.0.0, values can be literal columns, for instance, struct. For pivoting by multiple columns, use the structfunction to combine the columns and values:df.groupBy("year") .pivot("trainingCourse", Seq(struct(lit("java"), lit("Experts")))) .agg(sum($"earnings")) - pivotColumn
- Name of the column to pivot. 
- values
- List of values that will be translated to columns in the output DataFrame. 
 - Since
- 1.6.0 
- See also
- org.apache.spark.sql.Dataset.unpivotfor the reverse operation, except for the aggregation.
 
-    def pivot(pivotColumn: String): RelationalGroupedDatasetPivots a column of the current DataFrameand performs the specified aggregation.Pivots a column of the current DataFrameand performs the specified aggregation.Spark will eagerly compute the distinct values in pivotColumnso it can determine the resulting schema of the transformation. To avoid any eager computations, provide an explicit list of values viapivot(pivotColumn: String, values: Seq[Any]).// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course").sum("earnings") - pivotColumn
- Name of the column to pivot. 
 - Since
- 1.6.0 
- See also
- org.apache.spark.sql.Dataset.unpivotfor the reverse operation, except for the aggregation.
 
-    def sum(colNames: String*): DataFrameCompute the sum for each numeric columns for each group. Compute the sum for each numeric columns for each group. The resulting DataFramewill also contain the grouping columns. When specified columns are given, only compute the sum for them.- Annotations
- @varargs()
- Since
- 1.3.0 
 
-   final  def synchronized[T0](arg0: => T0): T0- Definition Classes
- AnyRef
 
-    def toString(): String- Definition Classes
- AnyRef → Any
 
-   final  def wait(arg0: Long, arg1: Int): Unit- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
 
-   final  def wait(arg0: Long): Unit- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
 
-   final  def wait(): Unit- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
 
Deprecated Value Members
-    def finalize(): Unit- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable]) @Deprecated
- Deprecated
- (Since version 9)