S- the JVM type for the aggregation's intermediate state; must be
R- the JVM type of result values
@Evolving public interface AggregateFunction<S extends java.io.Serializable,R> extends BoundFunction
For each input row, Spark will call the
update(S, org.apache.spark.sql.catalyst.InternalRow) method which should evaluate the row
and update the aggregation state. The JVM type of result values produced by
produceResult(S) must be the type used by Spark's
InternalRow API for the
SQL data type returned by
Please refer to class documentation of
ScalarFunction for the mapping between
DataType and the JVM type.
All implementations must support partial aggregation by implementing merge so that Spark can partially aggregate and shuffle intermediate results, instead of shuffling all rows for an aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the result.
Intermediate aggregation state must be
Serializable so that state produced by parallel
tasks can be serialized, shuffled, and then merged to produce a final result.
|Modifier and Type||Method and Description|
Merge two partial aggregation states.
Initialize state for an aggregation.
Produce the aggregation result based on intermediate state.
Update the aggregation state with a new row.
canonicalName, inputTypes, isDeterministic, isResultNullable, resultType
This method is called one or more times for every group of values to initialize intermediate aggregation state. More than one intermediate aggregation state variable may be used when the aggregation is run in parallel tasks.
Implementations that return null must support null state passed into all other methods.
This is called for each row in a group to update an intermediate aggregation state.
state- intermediate aggregation state
input- an input row
This is called to merge intermediate aggregation states that were produced by parallel tasks.
leftState- intermediate aggregation state
rightState- intermediate aggregation state