org.apache.spark.sql.connector.catalog.functions

ScalarFunction

trait ScalarFunction[R] extends BoundFunction

Interface for a function that produces a result value for each input row.

To evaluate each input row, Spark will first try to lookup and use a "magic method" (described below) through Java reflection. If the method is not found, Spark will call #produceResult(InternalRow) as a fallback approach.

The JVM type of result values produced by this function must be the type used by Spark's InternalRow API for the SQL data type returned by #resultType(). The mapping between DataType and the corresponding JVM type is defined below.

Magic method

IMPORTANT: the default implementation of #produceResult throws UnsupportedOperationException. Users must choose to either override this method, or implement a magic method with name #MAGIC_METHOD_NAME, which takes individual parameters instead of a InternalRow. The magic method approach is generally recommended because it provides better performance over the default #produceResult, due to optimizations such as whole-stage codegen, elimination of Java boxing, etc.

The type parameters for the magic method must match those returned from BoundFunction#inputTypes(). Otherwise Spark will not be able to find the magic method.

In addition, for stateless Java functions, users can optionally define the #MAGIC_METHOD_NAME as a static method, which further avoids certain runtime costs such as Java dynamic dispatch.

For example, a scalar UDF for adding two integers can be defined as follow with the magic method approach:

  public class IntegerAdd implementsScalarFunction {
    public DataType[] inputTypes() {
      return new DataType[] { DataTypes.IntegerType, DataTypes.IntegerType };
    }
    public int invoke(int left, int right) {
      return left + right;
    }
  }

In the above, since #MAGIC_METHOD_NAME is defined, and also that it has matching parameter types and return type, Spark will use it to evaluate inputs.

As another example, in the following:

  public class IntegerAdd implementsScalarFunction {
    public DataType[] inputTypes() {
      return new DataType[] { DataTypes.IntegerType, DataTypes.IntegerType };
    }
    public static int invoke(int left, int right) {
      return left + right;
    }
    public Integer produceResult(InternalRow input) {
      return input.getInt(0) + input.getInt(1);
    }
  }

the class defines both the magic method and the #produceResult, and Spark will use #MAGIC_METHOD_NAME over the #produceResult(InternalRow) as it takes higher precedence. Also note that the magic method is annotated as a static method in this case.

Resolution on magic method is done during query analysis, where Spark looks up the magic method by first converting the actual input SQL data types to their corresponding Java types following the mapping defined below, and then checking if there is a matching method from all the declared methods in the UDF class, using method name and the Java types.

Handling of nullable primitive arguments

The handling of null primitive arguments is different between the magic method approach and the #produceResult approach. With the former, whenever any of the method arguments meet the following conditions:

the argument is of primitive type
the argument is nullable
the value of the argument is null

Spark will return null directly instead of calling the magic method. On the other hand, Spark will pass null primitive arguments to #produceResult and it is user's responsibility to handle them in the function implementation.

Because of the difference, if Spark users want to implement special handling of nulls for nullable primitive arguments, they should override the #produceResult method instead of using the magic method approach.

Spark data type to Java type mapping

The following are the mapping from SQL data type to Java type which is used by Spark to infer parameter types for the magic methods as well as return value type for #produceResult:

org.apache.spark.sql.types.BooleanType: boolean
org.apache.spark.sql.types.ByteType: byte
org.apache.spark.sql.types.ShortType: short
org.apache.spark.sql.types.IntegerType: int
org.apache.spark.sql.types.LongType: long
org.apache.spark.sql.types.FloatType: float
org.apache.spark.sql.types.DoubleType: double
org.apache.spark.sql.types.StringType: org.apache.spark.unsafe.types.UTF8String
org.apache.spark.sql.types.DateType: int
org.apache.spark.sql.types.TimestampType: long
org.apache.spark.sql.types.BinaryType: byte[]
org.apache.spark.sql.types.DayTimeIntervalType: long
org.apache.spark.sql.types.YearMonthIntervalType: int
org.apache.spark.sql.types.DecimalType: org.apache.spark.sql.types.Decimal
org.apache.spark.sql.types.StructType: InternalRow
org.apache.spark.sql.types.ArrayType: org.apache.spark.sql.catalyst.util.ArrayData
org.apache.spark.sql.types.MapType: org.apache.spark.sql.catalyst.util.MapData

Annotations: @Evolving()
Source: ScalarFunction.java
Since: 3.2.0

Linear Supertypes

BoundFunction, Function, Serializable, AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

ScalarFunction
BoundFunction
Function
Serializable
AnyRef
Any

Hide All
Show All

Visibility

Public
Protected

Abstract Value Members

abstract def inputTypes(): Array[DataType]
Returns the required data types of the input values to this function.
Returns the required data types of the input values to this function.
If the types returned differ from the types passed to UnboundFunction#bind(StructType), Spark will cast input values to the required data types. This allows implementations to delegate input value casting to Spark.
returns
an array of input value data types
Definition Classes
BoundFunction
abstract def name(): String
A name to identify this function.
A name to identify this function. Implementations should provide a meaningful name, like the database and function name from the catalog.
Definition Classes
Function
abstract def resultType(): DataType
Returns the data type of values produced by this function.
Returns the data type of values produced by this function.
For example, a "plus" function may return IntegerType when it is bound to arguments that are also IntegerType.
returns
a data type for values produced by this function
Definition Classes
BoundFunction

Concrete Value Members

final def !=(arg0: Any): Boolean
Definition Classes
AnyRef → Any
final def ##: Int
Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean
Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0
Definition Classes
Any
def canonicalName(): String
Returns the canonical name of this function, used to determine if functions are equivalent.
Returns the canonical name of this function, used to determine if functions are equivalent.
The canonical name is used to determine whether two functions are the same when loaded by different catalogs. For example, the same catalog implementation may be used for by two environments, "prod" and "test". Functions produced by the catalogs may be equivalent, but loaded using different names, like "test.func_name" and "prod.func_name".
Names returned by this function should be unique and unlikely to conflict with similar functions in other catalogs. For example, many catalogs may define a "bucket" function with a different implementation. Adding context, like "com.mycompany.bucket(string)", is recommended to avoid unintentional collisions.
returns
a canonical name for this function
Definition Classes
BoundFunction
def clone(): AnyRef
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
final def eq(arg0: AnyRef): Boolean
Definition Classes
AnyRef
def equals(arg0: AnyRef): Boolean
Definition Classes
AnyRef → Any
final def getClass(): Class[_ <: AnyRef]
Definition Classes
AnyRef → Any
Annotations
@IntrinsicCandidate() @native()
def hashCode(): Int
Definition Classes
AnyRef → Any
Annotations
@IntrinsicCandidate() @native()
def isDeterministic(): Boolean
Returns whether this function result is deterministic.
Returns whether this function result is deterministic.
By default, functions are assumed to be deterministic. Functions that are not deterministic should override this method so that Spark can ensure the function runs only once for a given input.
returns
true if this function is deterministic, false otherwise
Definition Classes
BoundFunction
final def isInstanceOf[T0]: Boolean
Definition Classes
Any
def isResultNullable(): Boolean
Returns whether the values produced by this function may be null.
Returns whether the values produced by this function may be null.
For example, a "plus" function may return false when it is bound to arguments that are always non-null, but true when either argument may be null.
returns
true if values produced by this function may be null, false otherwise
Definition Classes
BoundFunction
final def ne(arg0: AnyRef): Boolean
Definition Classes
AnyRef
final def notify(): Unit
Definition Classes
AnyRef
Annotations
@IntrinsicCandidate() @native()
final def notifyAll(): Unit
Definition Classes
AnyRef
Annotations
@IntrinsicCandidate() @native()
def produceResult(input: InternalRow): R
Applies the function to an input row to produce a value.
Applies the function to an input row to produce a value.
input
an input row
returns
a result value
final def synchronized[T0](arg0: => T0): T0
Definition Classes
AnyRef
def toString(): String
Definition Classes
AnyRef → Any
final def wait(arg0: Long, arg1: Int): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
final def wait(arg0: Long): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException]) @native()
final def wait(): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])

Deprecated Value Members

def finalize(): Unit
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.Throwable]) @Deprecated
Deprecated
(Since version 9)

Packages

ScalarFunction

trait ScalarFunction[R] extends BoundFunction

Magic method

Handling of nullable primitive arguments

Spark data type to Java type mapping

Abstract Value Members

Concrete Value Members

Deprecated Value Members

Inherited from BoundFunction

Inherited from Function

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped

Packages

ScalarFunction

trait ScalarFunction[R] extends BoundFunction

Magic method

Handling of nullable primitive arguments

Spark data type to Java type mapping

Abstract Value Members

Concrete Value Members

Deprecated Value Members

Inherited from BoundFunction

Inherited from Function

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped

ScalarFunction