Type Parameters:: S - the JVM type for the aggregation's intermediate state; must be Serializable; R - the JVM type of result values

All Superinterfaces:: BoundFunction, Function, Serializable

@Evolving public interface AggregateFunction<S extends Serializable,R> extends BoundFunction

Interface for a function that produces a result value by aggregating over multiple input rows.

For each input row, Spark will call the update(S, org.apache.spark.sql.catalyst.InternalRow) method which should evaluate the row and update the aggregation state. The JVM type of result values produced by produceResult(S) must be the type used by Spark's InternalRow API for the SQL data type returned by BoundFunction.resultType(). Please refer to class documentation of ScalarFunction for the mapping between DataType and the JVM type.

All implementations must support partial aggregation by implementing merge so that Spark can partially aggregate and shuffle intermediate results, instead of shuffling all rows for an aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the result.

Intermediate aggregation state must be Serializable so that state produced by parallel tasks can be serialized, shuffled, and then merged to produce a final result.

Since:: 3.2.0

Method Summary

Modifier and Type

Method

Description

S

merge(S leftState, S rightState)

Merge two partial aggregation states.

S

newAggregationState()

Initialize state for an aggregation.

R

produceResult(S state)

Produce the aggregation result based on intermediate state.

S

update(S state, org.apache.spark.sql.catalyst.InternalRow input)

Update the aggregation state with a new row.

Methods inherited from interface org.apache.spark.sql.connector.catalog.functions.BoundFunction
canonicalName, inputTypes, isDeterministic, isResultNullable, resultType

Methods inherited from interface org.apache.spark.sql.connector.catalog.functions.Function
name

Method Details
- newAggregationState
  
  S newAggregationState()
  
  Initialize state for an aggregation.
  This method is called one or more times for every group of values to initialize intermediate aggregation state. More than one intermediate aggregation state variable may be used when the aggregation is run in parallel tasks.
  Implementations that return null must support null state passed into all other methods.
  
  Returns:
  
  a state instance or null
- update
  
  S update(S state, org.apache.spark.sql.catalyst.InternalRow input)
  
  Update the aggregation state with a new row.
  This is called for each row in a group to update an intermediate aggregation state.
  
  Parameters:
  
  state - intermediate aggregation state
  
  input - an input row
  
  Returns:
  
  updated aggregation state
- merge
  
  S merge(S leftState, S rightState)
  
  Merge two partial aggregation states.
  This is called to merge intermediate aggregation states that were produced by parallel tasks.
  
  Parameters:
  
  leftState - intermediate aggregation state
  
  rightState - intermediate aggregation state
  
  Returns:
  
  combined aggregation state
- produceResult
  
  R produceResult(S state)
  
  Produce the aggregation result based on intermediate state.
  
  Parameters:
  
  state - intermediate aggregation state
  
  Returns:
  
  a result value

Interface AggregateFunction<S extends Serializable,R>

Method Summary

Methods inherited from interface org.apache.spark.sql.connector.catalog.functions.BoundFunction

Methods inherited from interface org.apache.spark.sql.connector.catalog.functions.Function

Method Details

newAggregationState

update

merge

produceResult