Packages

c

org.apache.spark.sql

RelationalGroupedDataset

class RelationalGroupedDataset extends sql.api.RelationalGroupedDataset[Dataset]

A set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup (and also pivot).

The main method is the agg function, which has multiple variants. This class also contains some first-order statistics such as mean, sum for convenience.

Annotations
@Stable()
Source
RelationalGroupedDataset.scala
Since

2.0.0

Note

This class was named GroupedData in Spark 1.x.

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. RelationalGroupedDataset
  2. RelationalGroupedDataset
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Instance Constructors

  1. new RelationalGroupedDataset(df: DataFrame, groupingExprs: Seq[Expression], groupType: GroupType)
    Attributes
    protected[sql]

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##: Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. def agg(expr: Column, exprs: Column*): DataFrame

    Compute aggregates by specifying a series of aggregate columns.

    Compute aggregates by specifying a series of aggregate columns. Note that this function by default retains the grouping columns in its output. To not retain grouping columns, set spark.sql.retainGroupColumns to false.

    The available aggregate methods are defined in org.apache.spark.sql.functions.

    // Selects the age of the oldest employee and the aggregate expense for each department
    
    // Scala:
    import org.apache.spark.sql.functions._
    df.groupBy("department").agg(max("age"), sum("expense"))
    
    // Java:
    import static org.apache.spark.sql.functions.*;
    df.groupBy("department").agg(max("age"), sum("expense"));

    Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change to that behavior, set config variable spark.sql.retainGroupColumns to false.

    // Scala, 1.3.x:
    df.groupBy("department").agg($"department", max("age"), sum("expense"))
    
    // Java, 1.3.x:
    df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
    Annotations
    @varargs()
  5. def agg(exprs: Map[String, String]): DataFrame

    (Java-specific) Compute aggregates by specifying a map from column name to aggregate methods.

    (Java-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting DataFrame will also contain the grouping columns.

    The available aggregate methods are avg, max, min, sum, count.

    // Selects the age of the oldest employee and the aggregate expense for each department
    import com.google.common.collect.ImmutableMap;
    df.groupBy("department").agg(ImmutableMap.of("age", "max", "expense", "sum"));
    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
  6. def agg(exprs: Map[String, String]): DataFrame

    (Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods.

    (Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting DataFrame will also contain the grouping columns.

    The available aggregate methods are avg, max, min, sum, count.

    // Selects the age of the oldest employee and the aggregate expense for each department
    df.groupBy("department").agg(Map(
      "age" -> "max",
      "expense" -> "sum"
    ))
    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
  7. def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

    (Scala-specific) Compute aggregates by specifying the column names and aggregate methods.

    (Scala-specific) Compute aggregates by specifying the column names and aggregate methods. The resulting DataFrame will also contain the grouping columns.

    The available aggregate methods are avg, max, min, sum, count.

    // Selects the age of the oldest employee and the aggregate expense for each department
    df.groupBy("department").agg(
      "age" -> "max",
      "expense" -> "sum"
    )
    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
  8. def as[K, T](implicit arg0: Encoder[K], arg1: Encoder[T]): KeyValueGroupedDataset[K, T]

    Returns a KeyValueGroupedDataset where the data is grouped by the grouping expressions of current RelationalGroupedDataset.

    Returns a KeyValueGroupedDataset where the data is grouped by the grouping expressions of current RelationalGroupedDataset.

    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
  9. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  10. def avg(colNames: String*): DataFrame

    Compute the mean value for each numeric columns for each group.

    Compute the mean value for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the mean values for them.

    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
    Annotations
    @varargs()
  11. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
  12. def count(): DataFrame

    Count the number of rows for each group.

    Count the number of rows for each group. The resulting DataFrame will also contain the grouping columns.

    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
  13. val df: DataFrame
    Attributes
    protected[sql]
    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
  14. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  15. def equals(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef → Any
  16. final def getClass(): Class[_ <: AnyRef]
    Definition Classes
    AnyRef → Any
    Annotations
    @IntrinsicCandidate() @native()
  17. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @IntrinsicCandidate() @native()
  18. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  19. def max(colNames: String*): DataFrame

    Compute the max value for each numeric columns for each group.

    Compute the max value for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the max values for them.

    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
    Annotations
    @varargs()
  20. def mean(colNames: String*): DataFrame

    Compute the average value for each numeric columns for each group.

    Compute the average value for each numeric columns for each group. This is an alias for avg. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the average values for them.

    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
    Annotations
    @varargs()
  21. def min(colNames: String*): DataFrame

    Compute the min value for each numeric column for each group.

    Compute the min value for each numeric column for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the min values for them.

    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
    Annotations
    @varargs()
  22. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  23. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @IntrinsicCandidate() @native()
  24. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @IntrinsicCandidate() @native()
  25. def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset

    Pivots a column of the current DataFrame and performs the specified aggregation.

    Pivots a column of the current DataFrame and performs the specified aggregation. This is an overloaded version of the pivot method with pivotColumn of the String type.

    // Compute the sum of earnings for each year by course with each course as a separate column
    df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings")
    pivotColumn

    the column to pivot.

    values

    List of values that will be translated to columns in the output DataFrame.

    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
  26. def pivot(pivotColumn: Column): RelationalGroupedDataset

    Pivots a column of the current DataFrame and performs the specified aggregation.

    Pivots a column of the current DataFrame and performs the specified aggregation.

    Spark will eagerly compute the distinct values in pivotColumn so it can determine the resulting schema of the transformation. To avoid any eager computations, provide an explicit list of values via pivot(pivotColumn: Column, values: Seq[Any]).

    // Compute the sum of earnings for each year by course with each course as a separate column
    df.groupBy($"year").pivot($"course").sum($"earnings");
    pivotColumn

    he column to pivot.

    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
  27. def pivot(pivotColumn: Column, values: List[Any]): RelationalGroupedDataset

    (Java-specific) Pivots a column of the current DataFrame and performs the specified aggregation.

    (Java-specific) Pivots a column of the current DataFrame and performs the specified aggregation. This is an overloaded version of the pivot method with pivotColumn of the String type.

    pivotColumn

    the column to pivot.

    values

    List of values that will be translated to columns in the output DataFrame.

    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
  28. def pivot(pivotColumn: String, values: List[Any]): RelationalGroupedDataset

    (Java-specific) Pivots a column of the current DataFrame and performs the specified aggregation.

    (Java-specific) Pivots a column of the current DataFrame and performs the specified aggregation.

    There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.

    // Compute the sum of earnings for each year by course with each course as a separate column
    df.groupBy("year").pivot("course", Arrays.<Object>asList("dotNET", "Java")).sum("earnings");
    
    // Or without specifying column values (less efficient)
    df.groupBy("year").pivot("course").sum("earnings");
    pivotColumn

    Name of the column to pivot.

    values

    List of values that will be translated to columns in the output DataFrame.

    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
  29. def pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset

    Pivots a column of the current DataFrame and performs the specified aggregation.

    Pivots a column of the current DataFrame and performs the specified aggregation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.

    // Compute the sum of earnings for each year by course with each course as a separate column
    df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings")
    
    // Or without specifying column values (less efficient)
    df.groupBy("year").pivot("course").sum("earnings")

    From Spark 3.0.0, values can be literal columns, for instance, struct. For pivoting by multiple columns, use the struct function to combine the columns and values:

    df.groupBy("year")
      .pivot("trainingCourse", Seq(struct(lit("java"), lit("Experts"))))
      .agg(sum($"earnings"))
    pivotColumn

    Name of the column to pivot.

    values

    List of values that will be translated to columns in the output DataFrame.

    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
  30. def pivot(pivotColumn: String): RelationalGroupedDataset

    Pivots a column of the current DataFrame and performs the specified aggregation.

    Pivots a column of the current DataFrame and performs the specified aggregation.

    Spark will eagerly compute the distinct values in pivotColumn so it can determine the resulting schema of the transformation. To avoid any eager computations, provide an explicit list of values via pivot(pivotColumn: String, values: Seq[Any]).

    // Compute the sum of earnings for each year by course with each course as a separate column
    df.groupBy("year").pivot("course").sum("earnings")
    pivotColumn

    Name of the column to pivot.

    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
  31. def selectNumericColumns(colNames: Seq[String]): Seq[Column]
    Attributes
    protected
    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
  32. def sum(colNames: String*): DataFrame

    Compute the sum for each numeric columns for each group.

    Compute the sum for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the sum for them.

    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
    Annotations
    @varargs()
  33. final def synchronized[T0](arg0: => T0): T0
    Definition Classes
    AnyRef
  34. def toDF(aggCols: Seq[Column]): DataFrame

    Create a aggregation based on the grouping column, the grouping type, and the aggregations.

    Create a aggregation based on the grouping column, the grouping type, and the aggregations.

    Attributes
    protected
    Definition Classes
    RelationalGroupedDatasetRelationalGroupedDataset
  35. def toString(): String
    Definition Classes
    RelationalGroupedDataset → AnyRef → Any
  36. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  37. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException]) @native()
  38. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])

Deprecated Value Members

  1. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.Throwable]) @Deprecated
    Deprecated

    (Since version 9)

Inherited from AnyRef

Inherited from Any

Ungrouped