Class Aggregator<IN,BUF,OUT>

Object
org.apache.spark.sql.expressions.Aggregator<IN,BUF,OUT>
Type Parameters:
IN - The input type for the aggregation.
BUF - The type of the intermediate value of the reduction.
OUT - The type of the final output result.
All Implemented Interfaces:
Serializable
Direct Known Subclasses:
StringIndexerAggregator

public abstract class Aggregator<IN,BUF,OUT> extends Object implements Serializable
A base class for user-defined aggregations, which can be used in Dataset operations to take all of the elements of a group and reduce them to a single value.

For example, the following aggregator extracts an int from a specific class and adds them up:


   case class Data(i: Int)

   val customSummer =  new Aggregator[Data, Int, Int] {
     def zero: Int = 0
     def reduce(b: Int, a: Data): Int = b + a.i
     def merge(b1: Int, b2: Int): Int = b1 + b2
     def finish(r: Int): Int = r
     def bufferEncoder: Encoder[Int] = Encoders.scalaInt
     def outputEncoder: Encoder[Int] = Encoders.scalaInt
   }.toColumn()

   val ds: Dataset[Data] = ...
   val aggregated = ds.select(customSummer)
 

Based loosely on Aggregator from Algebird: https://github.com/twitter/algebird

Since:
1.6.0
See Also:
  • Constructor Details

    • Aggregator

      public Aggregator()
  • Method Details

    • bufferEncoder

      public abstract Encoder<BUF> bufferEncoder()
      Specifies the Encoder for the intermediate value type.
      Returns:
      (undocumented)
      Since:
      2.0.0
    • finish

      public abstract OUT finish(BUF reduction)
      Transform the output of the reduction.
      Parameters:
      reduction - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.6.0
    • merge

      public abstract BUF merge(BUF b1, BUF b2)
      Merge two intermediate values.
      Parameters:
      b1 - (undocumented)
      b2 - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.6.0
    • outputEncoder

      public abstract Encoder<OUT> outputEncoder()
      Specifies the Encoder for the final output value type.
      Returns:
      (undocumented)
      Since:
      2.0.0
    • reduce

      public abstract BUF reduce(BUF b, IN a)
      Combine two values to produce a new value. For performance, the function may modify b and return it instead of constructing new object for b.
      Parameters:
      b - (undocumented)
      a - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.6.0
    • toColumn

      public TypedColumn<IN,OUT> toColumn()
      Returns this Aggregator as a TypedColumn that can be used in Dataset. operations.
      Returns:
      (undocumented)
      Since:
      1.6.0
    • zero

      public abstract BUF zero()
      A zero value for this aggregation. Should satisfy the property that any b + zero = b.
      Returns:
      (undocumented)
      Since:
      1.6.0