functions

object functions

Commonly used functions available for DataFrame operations. Using functions defined here provides a little bit more compile-time safety to make sure the function exists.

You can call the functions defined here by two ways: _FUNC_(...) and functions.expr("_FUNC_(...)").

As an example, regr_count is a function that is defined here. You can use regr_count(col("yCol", col("xCol"))) to invoke the regr_count function. This way the programming language's compiler ensures regr_count exists and is of the proper form. You can also use expr("regr_count(yCol, xCol)") function to invoke the same function. In this case, Spark itself will ensure regr_count exists when it analyzes the query.

You can find the entire list of functions at SQL API documentation of your Spark version, see also the latest list

This function APIs usually have methods with Column signature only because it can support not only Column but also other types such as a native string. The other variants currently exist for historical reasons.

Annotations: @Stable()
Source: functions.scala
Since: 1.3.0

Linear Supertypes

AnyRef, Any

Ordering

Grouped
Alphabetic
By Inheritance

Inherited

functions
AnyRef
Any

Hide All
Show All

Visibility

Public
Protected

Value Members

final def !=(arg0: Any): Boolean
Definition Classes
AnyRef → Any
final def ##: Int
Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean
Definition Classes
AnyRef → Any
def abs(e: Column): Column
Computes the absolute value of a numeric value.
Computes the absolute value of a numeric value.
Since
1.3.0
def acos(columnName: String): Column
returns
inverse cosine of columnName, as if computed by java.lang.Math.acos
Since
1.4.0
def acos(e: Column): Column
returns
inverse cosine of e in radians, as if computed by java.lang.Math.acos
Since
1.4.0
def acosh(columnName: String): Column
returns
inverse hyperbolic cosine of columnName
Since
3.1.0
def acosh(e: Column): Column
returns
inverse hyperbolic cosine of e
Since
3.1.0
def add_months(startDate: Column, numMonths: Column): Column
Returns the date that is numMonths after startDate.
Returns the date that is numMonths after startDate.
startDate
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
numMonths
A column of the number of months to add to startDate, can be negative to subtract months
returns
A date, or null if startDate was a string that could not be cast to a date
Since
3.0.0
def add_months(startDate: Column, numMonths: Int): Column
Returns the date that is numMonths after startDate.
Returns the date that is numMonths after startDate.
startDate
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
numMonths
The number of months to add to startDate, can be negative to subtract months
returns
A date, or null if startDate was a string that could not be cast to a date
Since
1.5.0
def aes_decrypt(input: Column, key: Column): Column
Returns a decrypted value of input.
Returns a decrypted value of input.
Since
3.5.0
See also
org.apache.spark.sql.functions.aes_decrypt(Column, Column, Column, Column, Column)
def aes_decrypt(input: Column, key: Column, mode: Column): Column
Returns a decrypted value of input.
Returns a decrypted value of input.
Since
3.5.0
See also
org.apache.spark.sql.functions.aes_decrypt(Column, Column, Column, Column, Column)
def aes_decrypt(input: Column, key: Column, mode: Column, padding: Column): Column
Returns a decrypted value of input.
Returns a decrypted value of input.
Since
3.5.0
See also
org.apache.spark.sql.functions.aes_decrypt(Column, Column, Column, Column, Column)
def aes_decrypt(input: Column, key: Column, mode: Column, padding: Column, aad: Column): Column
Returns a decrypted value of input using AES in mode with padding.
Returns a decrypted value of input using AES in mode with padding. Key lengths of 16, 24 and 32 bits are supported. Supported combinations of (mode, padding) are ('ECB', 'PKCS'), ('GCM', 'NONE') and ('CBC', 'PKCS'). Optional additional authenticated data (AAD) is only supported for GCM. If provided for encryption, the identical AAD value must be provided for decryption. The default mode is GCM.
input
The binary value to decrypt.
key
The passphrase to use to decrypt the data.
mode
Specifies which block cipher mode should be used to decrypt messages. Valid modes: ECB, GCM, CBC.
padding
Specifies how to pad messages whose length is not a multiple of the block size. Valid values: PKCS, NONE, DEFAULT. The DEFAULT padding means PKCS for ECB, NONE for GCM and PKCS for CBC.
aad
Optional additional authenticated data. Only supported for GCM mode. This can be any free-form input and must be provided for both encryption and decryption.
Since
3.5.0
def aes_encrypt(input: Column, key: Column): Column
Returns an encrypted value of input.
Returns an encrypted value of input.
Since
3.5.0
See also
org.apache.spark.sql.functions.aes_encrypt(Column, Column, Column, Column, Column, Column)
def aes_encrypt(input: Column, key: Column, mode: Column): Column
Returns an encrypted value of input.
Returns an encrypted value of input.
Since
3.5.0
See also
org.apache.spark.sql.functions.aes_encrypt(Column, Column, Column, Column, Column, Column)
def aes_encrypt(input: Column, key: Column, mode: Column, padding: Column): Column
Returns an encrypted value of input.
Returns an encrypted value of input.
Since
3.5.0
See also
org.apache.spark.sql.functions.aes_encrypt(Column, Column, Column, Column, Column, Column)
def aes_encrypt(input: Column, key: Column, mode: Column, padding: Column, iv: Column): Column
Returns an encrypted value of input.
Returns an encrypted value of input.
Since
3.5.0
See also
org.apache.spark.sql.functions.aes_encrypt(Column, Column, Column, Column, Column, Column)
def aes_encrypt(input: Column, key: Column, mode: Column, padding: Column, iv: Column, aad: Column): Column
Returns an encrypted value of input using AES in given mode with the specified padding.
Returns an encrypted value of input using AES in given mode with the specified padding. Key lengths of 16, 24 and 32 bits are supported. Supported combinations of (mode, padding) are ('ECB', 'PKCS'), ('GCM', 'NONE') and ('CBC', 'PKCS'). Optional initialization vectors (IVs) are only supported for CBC and GCM modes. These must be 16 bytes for CBC and 12 bytes for GCM. If not provided, a random vector will be generated and prepended to the output. Optional additional authenticated data (AAD) is only supported for GCM. If provided for encryption, the identical AAD value must be provided for decryption. The default mode is GCM.
input
The binary value to encrypt.
key
The passphrase to use to encrypt the data.
mode
Specifies which block cipher mode should be used to encrypt messages. Valid modes: ECB, GCM, CBC.
padding
Specifies how to pad messages whose length is not a multiple of the block size. Valid values: PKCS, NONE, DEFAULT. The DEFAULT padding means PKCS for ECB, NONE for GCM and PKCS for CBC.
iv
Optional initialization vector. Only supported for CBC and GCM modes. Valid values: None or "". 16-byte array for CBC mode. 12-byte array for GCM mode.
aad
Optional additional authenticated data. Only supported for GCM mode. This can be any free-form input and must be provided for both encryption and decryption.
Since
3.5.0
def aggregate(expr: Column, initialValue: Column, merge: (Column, Column) => Column): Column
Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.
Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.
```
df.select(aggregate(col("i"), lit(0), (acc, x) => acc + x))
```
expr
the input array column
initialValue
the initial value
merge
(combined_value, input_value) => combined_value, the merge function to merge an input value to the combined_value
Since
3.0.0
def aggregate(expr: Column, initialValue: Column, merge: (Column, Column) => Column, finish: (Column) => Column): Column
Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.
Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function.
```
df.select(aggregate(col("i"), lit(0), (acc, x) => acc + x, _ * 10))
```
expr
the input array column
initialValue
the initial value
merge
(combined_value, input_value) => combined_value, the merge function to merge an input value to the combined_value
finish
combined_value => final_value, the lambda function to convert the combined value of all inputs to final result
Since
3.0.0
def any(e: Column): Column
Aggregate function: returns true if at least one value of e is true.
Aggregate function: returns true if at least one value of e is true.
Since
3.5.0
def any_value(e: Column, ignoreNulls: Column): Column
Aggregate function: returns some value of e for a group of rows.
Aggregate function: returns some value of e for a group of rows. If isIgnoreNull is true, returns only non-null values.
Since
3.5.0
def any_value(e: Column): Column
Aggregate function: returns some value of e for a group of rows.
Aggregate function: returns some value of e for a group of rows.
Since
3.5.0
def approx_count_distinct(columnName: String, rsd: Double): Column
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
rsd
maximum relative standard deviation allowed (default = 0.05)
Since
2.1.0
def approx_count_distinct(e: Column, rsd: Double): Column
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
rsd
maximum relative standard deviation allowed (default = 0.05)
Since
2.1.0
def approx_count_distinct(columnName: String): Column
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
Since
2.1.0
def approx_count_distinct(e: Column): Column
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
Since
2.1.0
def approx_percentile(e: Column, percentage: Column, accuracy: Column): Column
Aggregate function: returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value.
Aggregate function: returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value.
If percentage is an array, each value must be between 0.0 and 1.0. If it is a single floating point value, it must be between 0.0 and 1.0.
The accuracy parameter is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation.
Since
3.5.0
def array(colName: String, colNames: String*): Column
Creates a new array column.
Creates a new array column. The input columns must all have the same data type.
Annotations
@varargs()
Since
1.4.0
def array(cols: Column*): Column
Creates a new array column.
Creates a new array column. The input columns must all have the same data type.
Annotations
@varargs()
Since
1.4.0
def array_agg(e: Column): Column
Aggregate function: returns a list of objects with duplicates.
Aggregate function: returns a list of objects with duplicates.
Since
3.5.0
Note
The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.
def array_append(column: Column, element: Any): Column
Returns an ARRAY containing all elements from the source ARRAY as well as the new element.
Returns an ARRAY containing all elements from the source ARRAY as well as the new element. The new element/column is located at end of the ARRAY.
Since
3.4.0
def array_compact(column: Column): Column
Remove all null elements from the given array.
Remove all null elements from the given array.
Since
3.4.0
def array_contains(column: Column, value: Any): Column
Returns null if the array is null, true if the array contains value, and false otherwise.
Returns null if the array is null, true if the array contains value, and false otherwise.
Since
1.5.0
def array_distinct(e: Column): Column
Removes duplicate values from the array.
Removes duplicate values from the array.
Since
2.4.0
def array_except(col1: Column, col2: Column): Column
Returns an array of the elements in the first array but not in the second array, without duplicates.
Returns an array of the elements in the first array but not in the second array, without duplicates. The order of elements in the result is not determined
Since
2.4.0
def array_insert(arr: Column, pos: Column, value: Column): Column
Adds an item into a given array at a specified position
Adds an item into a given array at a specified position
Since
3.4.0
def array_intersect(col1: Column, col2: Column): Column
Returns an array of the elements in the intersection of the given two arrays, without duplicates.
Returns an array of the elements in the intersection of the given two arrays, without duplicates.
Since
2.4.0
def array_join(column: Column, delimiter: String): Column
Concatenates the elements of column using the delimiter.
Concatenates the elements of column using the delimiter.
Since
2.4.0
def array_join(column: Column, delimiter: String, nullReplacement: String): Column
Concatenates the elements of column using the delimiter.
Concatenates the elements of column using the delimiter. Null values are replaced with nullReplacement.
Since
2.4.0
def array_max(e: Column): Column
Returns the maximum value in the array.
Returns the maximum value in the array. NaN is greater than any non-NaN elements for double/float type. NULL elements are skipped.
Since
2.4.0
def array_min(e: Column): Column
Returns the minimum value in the array.
Returns the minimum value in the array. NaN is greater than any non-NaN elements for double/float type. NULL elements are skipped.
Since
2.4.0
def array_position(column: Column, value: Any): Column
Locates the position of the first occurrence of the value in the given array as long.
Locates the position of the first occurrence of the value in the given array as long. Returns null if either of the arguments are null.
Since
2.4.0
Note
The position is not zero based, but 1 based index. Returns 0 if value could not be found in array.
def array_prepend(column: Column, element: Any): Column
Returns an array containing value as well as all elements from array.
Returns an array containing value as well as all elements from array. The new element is positioned at the beginning of the array.
Since
3.5.0
def array_remove(column: Column, element: Any): Column
Remove all elements that equal to element from the given array.
Remove all elements that equal to element from the given array.
Since
2.4.0
def array_repeat(e: Column, count: Int): Column
Creates an array containing the left argument repeated the number of times given by the right argument.
Creates an array containing the left argument repeated the number of times given by the right argument.
Since
2.4.0
def array_repeat(left: Column, right: Column): Column
Creates an array containing the left argument repeated the number of times given by the right argument.
Creates an array containing the left argument repeated the number of times given by the right argument.
Since
2.4.0
def array_size(e: Column): Column
Returns the total number of elements in the array.
Returns the total number of elements in the array. The function returns null for null input.
Since
3.5.0
def array_sort(e: Column, comparator: (Column, Column) => Column): Column
Sorts the input array based on the given comparator function.
Sorts the input array based on the given comparator function. The comparator will take two arguments representing two elements of the array. It returns a negative integer, 0, or a positive integer as the first element is less than, equal to, or greater than the second element. If the comparator function returns null, the function will fail and raise an error.
Since
3.4.0
def array_sort(e: Column): Column
Sorts the input array in ascending order.
Sorts the input array in ascending order. The elements of the input array must be orderable. NaN is greater than any non-NaN elements for double/float type. Null elements will be placed at the end of the returned array.
Since
2.4.0
def array_union(col1: Column, col2: Column): Column
Returns an array of the elements in the union of the given two arrays, without duplicates.
Returns an array of the elements in the union of the given two arrays, without duplicates.
Since
2.4.0
def arrays_overlap(a1: Column, a2: Column): Column
Returns true if a1 and a2 have at least one non-null element in common.
Returns true if a1 and a2 have at least one non-null element in common. If not and both the arrays are non-empty and any of them contains a null, it returns null. It returns false otherwise.
Since
2.4.0
def arrays_zip(e: Column*): Column
Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
Annotations
@varargs()
Since
2.4.0
final def asInstanceOf[T0]: T0
Definition Classes
Any
def asc(columnName: String): Column
Returns a sort expression based on ascending order of the column.
Returns a sort expression based on ascending order of the column.
```
df.sort(asc("dept"), desc("age"))
```
Since
1.3.0
def asc_nulls_first(columnName: String): Column
Returns a sort expression based on ascending order of the column, and null values return before non-null values.
Returns a sort expression based on ascending order of the column, and null values return before non-null values.
```
df.sort(asc_nulls_first("dept"), desc("age"))
```
Since
2.1.0
def asc_nulls_last(columnName: String): Column
Returns a sort expression based on ascending order of the column, and null values appear after non-null values.
Returns a sort expression based on ascending order of the column, and null values appear after non-null values.
```
df.sort(asc_nulls_last("dept"), desc("age"))
```
Since
2.1.0
def ascii(e: Column): Column
Computes the numeric value of the first character of the string column, and returns the result as an int column.
Computes the numeric value of the first character of the string column, and returns the result as an int column.
Since
1.5.0
def asin(columnName: String): Column
returns
inverse sine of columnName, as if computed by java.lang.Math.asin
Since
1.4.0
def asin(e: Column): Column
returns
inverse sine of e in radians, as if computed by java.lang.Math.asin
Since
1.4.0
def asinh(columnName: String): Column
returns
inverse hyperbolic sine of columnName
Since
3.1.0
def asinh(e: Column): Column
returns
inverse hyperbolic sine of e
Since
3.1.0
def assert_true(c: Column, e: Column): Column
Returns null if the condition is true; throws an exception with the error message otherwise.
Returns null if the condition is true; throws an exception with the error message otherwise.
Since
3.1.0
def assert_true(c: Column): Column
Returns null if the condition is true, and throws an exception otherwise.
Returns null if the condition is true, and throws an exception otherwise.
Since
3.1.0
def atan(columnName: String): Column
returns
inverse tangent of columnName, as if computed by java.lang.Math.atan
Since
1.4.0
def atan(e: Column): Column
returns
inverse tangent of e as if computed by java.lang.Math.atan
Since
1.4.0
def atan2(yValue: Double, xName: String): Column
yValue
coordinate on y-axis
xName
coordinate on x-axis
returns
the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2
Since
1.4.0
def atan2(yValue: Double, x: Column): Column
yValue
coordinate on y-axis
x
coordinate on x-axis
returns
the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2
Since
1.4.0
def atan2(yName: String, xValue: Double): Column
yName
coordinate on y-axis
xValue
coordinate on x-axis
returns
the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2
Since
1.4.0
def atan2(y: Column, xValue: Double): Column
y
coordinate on y-axis
xValue
coordinate on x-axis
returns
the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2
Since
1.4.0
def atan2(yName: String, xName: String): Column
yName
coordinate on y-axis
xName
coordinate on x-axis
returns
the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2
Since
1.4.0
def atan2(yName: String, x: Column): Column
yName
coordinate on y-axis
x
coordinate on x-axis
returns
the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2
Since
1.4.0
def atan2(y: Column, xName: String): Column
y
coordinate on y-axis
xName
coordinate on x-axis
returns
the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2
Since
1.4.0
def atan2(y: Column, x: Column): Column
y
coordinate on y-axis
x
coordinate on x-axis
returns
the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2
Since
1.4.0
def atanh(columnName: String): Column
returns
inverse hyperbolic tangent of columnName
Since
3.1.0
def atanh(e: Column): Column
returns
inverse hyperbolic tangent of e
Since
3.1.0
def avg(columnName: String): Column
Aggregate function: returns the average of the values in a group.
Aggregate function: returns the average of the values in a group.
Since
1.3.0
def avg(e: Column): Column
Aggregate function: returns the average of the values in a group.
Aggregate function: returns the average of the values in a group.
Since
1.3.0
def base64(e: Column): Column
Computes the BASE64 encoding of a binary column and returns it as a string column.
Computes the BASE64 encoding of a binary column and returns it as a string column. This is the reverse of unbase64.
Since
1.5.0
def bin(columnName: String): Column
An expression that returns the string representation of the binary value of the given long column.
An expression that returns the string representation of the binary value of the given long column. For example, bin("12") returns "1100".
Since
1.5.0
def bin(e: Column): Column
An expression that returns the string representation of the binary value of the given long column.
An expression that returns the string representation of the binary value of the given long column. For example, bin("12") returns "1100".
Since
1.5.0
def bit_and(e: Column): Column
Aggregate function: returns the bitwise AND of all non-null input values, or null if none.
Aggregate function: returns the bitwise AND of all non-null input values, or null if none.
Since
3.5.0
def bit_count(e: Column): Column
Returns the number of bits that are set in the argument expr as an unsigned 64-bit integer, or NULL if the argument is NULL.
Returns the number of bits that are set in the argument expr as an unsigned 64-bit integer, or NULL if the argument is NULL.
Since
3.5.0
def bit_get(e: Column, pos: Column): Column
Returns the value of the bit (0 or 1) at the specified position.
Returns the value of the bit (0 or 1) at the specified position. The positions are numbered from right to left, starting at zero. The position argument cannot be negative.
Since
3.5.0
def bit_length(e: Column): Column
Calculates the bit length for the specified string column.
Calculates the bit length for the specified string column.
Since
3.3.0
def bit_or(e: Column): Column
Aggregate function: returns the bitwise OR of all non-null input values, or null if none.
Aggregate function: returns the bitwise OR of all non-null input values, or null if none.
Since
3.5.0
def bit_xor(e: Column): Column
Aggregate function: returns the bitwise XOR of all non-null input values, or null if none.
Aggregate function: returns the bitwise XOR of all non-null input values, or null if none.
Since
3.5.0
def bitmap_and_agg(col: Column): Column
Returns a bitmap that is the bitwise AND of all of the bitmaps from the input column.
Returns a bitmap that is the bitwise AND of all of the bitmaps from the input column. The input column should be bitmaps created from bitmap_construct_agg().
Since
4.1.0
def bitmap_bit_position(col: Column): Column
Returns the bucket number for the given input column.
Returns the bucket number for the given input column.
Since
3.5.0
def bitmap_bucket_number(col: Column): Column
Returns the bit position for the given input column.
Returns the bit position for the given input column.
Since
3.5.0
def bitmap_construct_agg(col: Column): Column
Returns a bitmap with the positions of the bits set from all the values from the input column.
Returns a bitmap with the positions of the bits set from all the values from the input column. The input column will most likely be bitmap_bit_position().
Since
3.5.0
def bitmap_count(col: Column): Column
Returns the number of set bits in the input bitmap.
Returns the number of set bits in the input bitmap.
Since
3.5.0
def bitmap_or_agg(col: Column): Column
Returns a bitmap that is the bitwise OR of all of the bitmaps from the input column.
Returns a bitmap that is the bitwise OR of all of the bitmaps from the input column. The input column should be bitmaps created from bitmap_construct_agg().
Since
3.5.0
def bitwise_not(e: Column): Column
Computes bitwise NOT (~) of a number.
Computes bitwise NOT (~) of a number.
Since
3.2.0
def bool_and(e: Column): Column
Aggregate function: returns true if all values of e are true.
Aggregate function: returns true if all values of e are true.
Since
3.5.0
def bool_or(e: Column): Column
Aggregate function: returns true if at least one value of e is true.
Aggregate function: returns true if at least one value of e is true.
Since
3.5.0
def broadcast[U](df: Dataset[U]): df.type
Marks a DataFrame as small enough for use in broadcast joins.
Marks a DataFrame as small enough for use in broadcast joins.
The following example marks the right DataFrame for broadcast hash join using joinKey.
```
// left and right are DataFrames
left.join(broadcast(right), "joinKey")
```
Since
1.5.0
def bround(e: Column, scale: Column): Column
Round the value of e to scale decimal places with HALF_EVEN round mode if scale is greater than or equal to 0 or at integral part when scale is less than 0.
Round the value of e to scale decimal places with HALF_EVEN round mode if scale is greater than or equal to 0 or at integral part when scale is less than 0.
Since
4.0.0
def bround(e: Column, scale: Int): Column
Round the value of e to scale decimal places with HALF_EVEN round mode if scale is greater than or equal to 0 or at integral part when scale is less than 0.
Round the value of e to scale decimal places with HALF_EVEN round mode if scale is greater than or equal to 0 or at integral part when scale is less than 0.
Since
2.0.0
def bround(e: Column): Column
Returns the value of the column e rounded to 0 decimal places with HALF_EVEN round mode.
Returns the value of the column e rounded to 0 decimal places with HALF_EVEN round mode.
Since
2.0.0
def btrim(str: Column, trim: Column): Column
Remove the leading and trailing trim characters from str.
Remove the leading and trailing trim characters from str.
Since
3.5.0
def btrim(str: Column): Column
Removes the leading and trailing space characters from str.
Removes the leading and trailing space characters from str.
Since
3.5.0
def bucket(numBuckets: Int, e: Column): Column
(Java-specific) A transform for any type that partitions by a hash of the input column.
(Java-specific) A transform for any type that partitions by a hash of the input column.
Since
3.0.0
def bucket(numBuckets: Column, e: Column): Column
(Java-specific) A transform for any type that partitions by a hash of the input column.
(Java-specific) A transform for any type that partitions by a hash of the input column.
Since
3.0.0
def call_function(funcName: String, cols: Column*): Column
Call a SQL function.
Call a SQL function.
funcName
function name that follows the SQL identifier syntax (can be quoted, can be qualified)
cols
the expression parameters of function
Annotations
@varargs()
Since
3.5.0

def call_udf(udfName: String, cols: Column*): Column

Call an user-defined function.

Call an user-defined function. Example:

import org.apache.spark.sql._

val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
val spark = df.sparkSession
spark.udf.register("simpleUDF", (v: Int) => v * v)
df.select($"id", call_udf("simpleUDF", $"value"))

Annotations: @varargs()
Since: 3.2.0

def cardinality(e: Column): Column
Returns length of array or map.
Returns length of array or map. This is an alias of size function.
This function returns -1 for null input only if spark.sql.ansi.enabled is false and spark.sql.legacy.sizeOfNull is true. Otherwise, it returns null for null input. With the default settings, the function returns null for null input.
Since
3.5.0
def cbrt(columnName: String): Column
Computes the cube-root of the given column.
Computes the cube-root of the given column.
Since
1.4.0
def cbrt(e: Column): Column
Computes the cube-root of the given value.
Computes the cube-root of the given value.
Since
1.4.0
def ceil(columnName: String): Column
Computes the ceiling of the given value of e to 0 decimal places.
Computes the ceiling of the given value of e to 0 decimal places.
Since
1.4.0
def ceil(e: Column): Column
Computes the ceiling of the given value of e to 0 decimal places.
Computes the ceiling of the given value of e to 0 decimal places.
Since
1.4.0
def ceil(e: Column, scale: Column): Column
Computes the ceiling of the given value of e to scale decimal places.
Computes the ceiling of the given value of e to scale decimal places.
Since
3.3.0
def ceiling(e: Column): Column
Computes the ceiling of the given value of e to 0 decimal places.
Computes the ceiling of the given value of e to 0 decimal places.
Since
3.5.0
def ceiling(e: Column, scale: Column): Column
Computes the ceiling of the given value of e to scale decimal places.
Computes the ceiling of the given value of e to scale decimal places.
Since
3.5.0
def char(n: Column): Column
Returns the ASCII character having the binary equivalent to n.
Returns the ASCII character having the binary equivalent to n. If n is larger than 256 the result is equivalent to char(n % 256)
Since
3.5.0
def char_length(str: Column): Column
Returns the character length of string data or number of bytes of binary data.
Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Since
3.5.0
def character_length(str: Column): Column
Returns the character length of string data or number of bytes of binary data.
Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Since
3.5.0
def chr(n: Column): Column
Returns the ASCII character having the binary equivalent to n.
Returns the ASCII character having the binary equivalent to n. If n is larger than 256 the result is equivalent to chr(n % 256)
Since
3.5.0
def clone(): AnyRef
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
def coalesce(e: Column*): Column
Returns the first column that is not null, or null if all inputs are null.
Returns the first column that is not null, or null if all inputs are null.
For example, coalesce(a, b, c) will return a if a is not null, or b if a is null and b is not null, or c if both a and b are null but c is not null.
Annotations
@varargs()
Since
1.3.0
def col(colName: String): Column
Returns a Column based on the given column name.
Returns a Column based on the given column name.
Since
1.3.0
def collate(e: Column, collation: String): Column
Marks a given column with specified collation.
Marks a given column with specified collation.
Since
4.0.0
def collation(e: Column): Column
Returns the collation name of a given column.
Returns the collation name of a given column.
Since
4.0.0
def collect_list(columnName: String): Column
Aggregate function: returns a list of objects with duplicates.
Aggregate function: returns a list of objects with duplicates.
Since
1.6.0
Note
The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.
def collect_list(e: Column): Column
Aggregate function: returns a list of objects with duplicates.
Aggregate function: returns a list of objects with duplicates.
Since
1.6.0
Note
The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.
def collect_set(columnName: String): Column
Aggregate function: returns a set of objects with duplicate elements eliminated.
Aggregate function: returns a set of objects with duplicate elements eliminated.
Since
1.6.0
Note
The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.
def collect_set(e: Column): Column
Aggregate function: returns a set of objects with duplicate elements eliminated.
Aggregate function: returns a set of objects with duplicate elements eliminated.
Since
1.6.0
Note
The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.
def column(colName: String): Column
Returns a Column based on the given column name.
Returns a Column based on the given column name. Alias of col.
Since
1.3.0
def concat(exprs: Column*): Column
Concatenates multiple input columns together into a single column.
Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.
Annotations
@varargs()
Since
1.5.0
Note
Returns null if any of the input columns are null.
def concat_ws(sep: String, exprs: Column*): Column
Concatenates multiple input string columns together into a single string column, using the given separator.
Concatenates multiple input string columns together into a single string column, using the given separator.
Annotations
@varargs()
Since
1.5.0
Note
Input strings which are null are skipped.
def contains(left: Column, right: Column): Column
Returns a boolean.
Returns a boolean. The value is True if right is found inside left. Returns NULL if either input expression is NULL. Otherwise, returns False. Both left or right must be of STRING or BINARY type.
Since
3.5.0
def conv(num: Column, fromBase: Int, toBase: Int): Column
Convert a number in a string column from one base to another.
Convert a number in a string column from one base to another.
Since
1.5.0
def convert_timezone(targetTz: Column, sourceTs: Column): Column
Converts the timestamp without time zone sourceTs from the current time zone to targetTz.
Converts the timestamp without time zone sourceTs from the current time zone to targetTz.
targetTz
the time zone to which the input timestamp should be converted.
sourceTs
a timestamp without time zone.
Since
3.5.0
def convert_timezone(sourceTz: Column, targetTz: Column, sourceTs: Column): Column
Converts the timestamp without time zone sourceTs from the sourceTz time zone to targetTz.
Converts the timestamp without time zone sourceTs from the sourceTz time zone to targetTz.
sourceTz
the time zone for the input timestamp. If it is missed, the current session time zone is used as the source time zone.
targetTz
the time zone to which the input timestamp should be converted.
sourceTs
a timestamp without time zone.
Since
3.5.0
def corr(columnName1: String, columnName2: String): Column
Aggregate function: returns the Pearson Correlation Coefficient for two columns.
Aggregate function: returns the Pearson Correlation Coefficient for two columns.
Since
1.6.0
def corr(column1: Column, column2: Column): Column
Aggregate function: returns the Pearson Correlation Coefficient for two columns.
Aggregate function: returns the Pearson Correlation Coefficient for two columns.
Since
1.6.0
def cos(columnName: String): Column
columnName
angle in radians
returns
cosine of the angle, as if computed by java.lang.Math.cos
Since
1.4.0
def cos(e: Column): Column
e
angle in radians
returns
cosine of the angle, as if computed by java.lang.Math.cos
Since
1.4.0
def cosh(columnName: String): Column
columnName
hyperbolic angle
returns
hyperbolic cosine of the angle, as if computed by java.lang.Math.cosh
Since
1.4.0
def cosh(e: Column): Column
e
hyperbolic angle
returns
hyperbolic cosine of the angle, as if computed by java.lang.Math.cosh
Since
1.4.0
def cot(e: Column): Column
e
angle in radians
returns
cotangent of the angle
Since
3.3.0
def count(columnName: String): TypedColumn[Any, Long]
Aggregate function: returns the number of items in a group.
Aggregate function: returns the number of items in a group.
Since
1.3.0
def count(e: Column): Column
Aggregate function: returns the number of items in a group.
Aggregate function: returns the number of items in a group.
Since
1.3.0
def countDistinct(columnName: String, columnNames: String*): Column
Aggregate function: returns the number of distinct items in a group.
Aggregate function: returns the number of distinct items in a group.
An alias of count_distinct, and it is encouraged to use count_distinct directly.
Annotations
@varargs()
Since
1.3.0
def countDistinct(expr: Column, exprs: Column*): Column
Aggregate function: returns the number of distinct items in a group.
Aggregate function: returns the number of distinct items in a group.
An alias of count_distinct, and it is encouraged to use count_distinct directly.
Annotations
@varargs()
Since
1.3.0
def count_distinct(expr: Column, exprs: Column*): Column
Aggregate function: returns the number of distinct items in a group.
Aggregate function: returns the number of distinct items in a group.
Annotations
@varargs()
Since
3.2.0
def count_if(e: Column): Column
Aggregate function: returns the number of TRUE values for the expression.
Aggregate function: returns the number of TRUE values for the expression.
Since
3.5.0
def count_min_sketch(e: Column, eps: Column, confidence: Column): Column
Returns a count-min sketch of a column with the given esp, confidence and seed.
Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
Since
4.0.0
def count_min_sketch(e: Column, eps: Column, confidence: Column, seed: Column): Column
Returns a count-min sketch of a column with the given esp, confidence and seed.
Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
Since
3.5.0
def covar_pop(columnName1: String, columnName2: String): Column
Aggregate function: returns the population covariance for two columns.
Aggregate function: returns the population covariance for two columns.
Since
2.0.0
def covar_pop(column1: Column, column2: Column): Column
Aggregate function: returns the population covariance for two columns.
Aggregate function: returns the population covariance for two columns.
Since
2.0.0
def covar_samp(columnName1: String, columnName2: String): Column
Aggregate function: returns the sample covariance for two columns.
Aggregate function: returns the sample covariance for two columns.
Since
2.0.0
def covar_samp(column1: Column, column2: Column): Column
Aggregate function: returns the sample covariance for two columns.
Aggregate function: returns the sample covariance for two columns.
Since
2.0.0
def crc32(e: Column): Column
Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a bigint.
Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a bigint.
Since
1.5.0
def csc(e: Column): Column
e
angle in radians
returns
cosecant of the angle
Since
3.3.0
def cume_dist(): Column
Window function: returns the cumulative distribution of values within a window partition, i.e.
Window function: returns the cumulative distribution of values within a window partition, i.e. the fraction of rows that are below the current row.
```
N = total number of rows in the partition
cumeDist(x) = number of values before (and including) x / N
```
Since
1.6.0
def curdate(): Column
Returns the current date at the start of query evaluation as a date column.
Returns the current date at the start of query evaluation as a date column. All calls of current_date within the same query return the same value.
Since
3.5.0
def current_catalog(): Column
Returns the current catalog.
Returns the current catalog.
Since
3.5.0
def current_database(): Column
Returns the current database.
Returns the current database.
Since
3.5.0
def current_date(): Column
Returns the current date at the start of query evaluation as a date column.
Returns the current date at the start of query evaluation as a date column. All calls of current_date within the same query return the same value.
Since
1.5.0
def current_schema(): Column
Returns the current schema.
Returns the current schema.
Since
3.5.0
def current_time(precision: Int): Column
Returns the current time at the start of query evaluation.
Returns the current time at the start of query evaluation.
precision
An integer literal in the range [0..6], indicating how many fractional digits of seconds to include in the result.
returns
A time.
Since
4.1.0
def current_time(): Column
Returns the current time at the start of query evaluation.
Returns the current time at the start of query evaluation. Note that the result will contain 6 fractional digits of seconds.
returns
A time.
Since
4.1.0
def current_timestamp(): Column
Returns the current timestamp at the start of query evaluation as a timestamp column.
Returns the current timestamp at the start of query evaluation as a timestamp column. All calls of current_timestamp within the same query return the same value.
Since
1.5.0
def current_timezone(): Column
Returns the current session local timezone.
Returns the current session local timezone.
Since
3.5.0
def current_user(): Column
Returns the user name of current execution context.
Returns the user name of current execution context.
Since
3.5.0
def date_add(start: Column, days: Column): Column
Returns the date that is days days after start
Returns the date that is days days after start
start
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
days
A column of the number of days to add to start, can be negative to subtract days
returns
A date, or null if start was a string that could not be cast to a date
Since
3.0.0
def date_add(start: Column, days: Int): Column
Returns the date that is days days after start
Returns the date that is days days after start
start
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
days
The number of days to add to start, can be negative to subtract days
returns
A date, or null if start was a string that could not be cast to a date
Since
1.5.0
def date_diff(end: Column, start: Column): Column
Returns the number of days from start to end.
Returns the number of days from start to end.
Only considers the date part of the input. For example:
```
dateddiff("2018-01-10 00:00:00", "2018-01-09 23:59:59")
// returns 1
```
end
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
start
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
returns
An integer, or null if either end or start were strings that could not be cast to a date. Negative if end is before start
Since
3.5.0
def date_format(dateExpr: Column, format: String): Column
Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
See Datetime Patterns for valid date and time format patterns
dateExpr
A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
format
A pattern dd.MM.yyyy would return a string like 18.03.1993
returns
A string, or null if dateExpr was a string that could not be cast to a timestamp
Since
1.5.0
Exceptions thrown
IllegalArgumentException if the format pattern is invalid
Note
Use specialized functions like year whenever possible as they benefit from a specialized implementation.
def date_from_unix_date(days: Column): Column
Create date from the number of days since 1970-01-01.
Create date from the number of days since 1970-01-01.
Since
3.5.0
def date_part(field: Column, source: Column): Column
Extracts a part of the date/timestamp or interval source.
Extracts a part of the date/timestamp or interval source.
field
selects which part of the source should be extracted, and supported string values are as same as the fields of the equivalent function extract.
source
a date/timestamp or interval column from where field should be extracted.
returns
a part of the date/timestamp or interval source
Since
3.5.0
def date_sub(start: Column, days: Column): Column
Returns the date that is days days before start
Returns the date that is days days before start
start
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
days
A column of the number of days to subtract from start, can be negative to add days
returns
A date, or null if start was a string that could not be cast to a date
Since
3.0.0
def date_sub(start: Column, days: Int): Column
Returns the date that is days days before start
Returns the date that is days days before start
start
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
days
The number of days to subtract from start, can be negative to add days
returns
A date, or null if start was a string that could not be cast to a date
Since
1.5.0
def date_trunc(format: String, timestamp: Column): Column
Returns timestamp truncated to the unit specified by the format.
Returns timestamp truncated to the unit specified by the format.
For example, date_trunc("year", "2018-11-19 12:01:19") returns 2018-01-01 00:00:00
timestamp
A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
returns
A timestamp, or null if timestamp was a string that could not be cast to a timestamp or format was an invalid value
Since
2.3.0
def dateadd(start: Column, days: Column): Column
Returns the date that is days days after start
Returns the date that is days days after start
start
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
days
A column of the number of days to add to start, can be negative to subtract days
returns
A date, or null if start was a string that could not be cast to a date
Since
3.5.0
def datediff(end: Column, start: Column): Column
Returns the number of days from start to end.
Returns the number of days from start to end.
Only considers the date part of the input. For example:
```
dateddiff("2018-01-10 00:00:00", "2018-01-09 23:59:59")
// returns 1
```
end
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
start
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
returns
An integer, or null if either end or start were strings that could not be cast to a date. Negative if end is before start
Since
1.5.0
def datepart(field: Column, source: Column): Column
Extracts a part of the date/timestamp or interval source.
Extracts a part of the date/timestamp or interval source.
field
selects which part of the source should be extracted, and supported string values are as same as the fields of the equivalent function EXTRACT.
source
a date/timestamp or interval column from where field should be extracted.
returns
a part of the date/timestamp or interval source
Since
3.5.0
def day(e: Column): Column
Extracts the day of the month as an integer from a given date/timestamp/string.
Extracts the day of the month as an integer from a given date/timestamp/string.
returns
An integer, or null if the input was a string that could not be cast to a date
Since
3.5.0
def dayname(timeExp: Column): Column
Extracts the three-letter abbreviated day name from a given date/timestamp/string.
Extracts the three-letter abbreviated day name from a given date/timestamp/string.
Since
4.0.0
def dayofmonth(e: Column): Column
Extracts the day of the month as an integer from a given date/timestamp/string.
Extracts the day of the month as an integer from a given date/timestamp/string.
returns
An integer, or null if the input was a string that could not be cast to a date
Since
1.5.0
def dayofweek(e: Column): Column
Extracts the day of the week as an integer from a given date/timestamp/string.
Extracts the day of the week as an integer from a given date/timestamp/string. Ranges from 1 for a Sunday through to 7 for a Saturday
returns
An integer, or null if the input was a string that could not be cast to a date
Since
2.3.0
def dayofyear(e: Column): Column
Extracts the day of the year as an integer from a given date/timestamp/string.
Extracts the day of the year as an integer from a given date/timestamp/string.
returns
An integer, or null if the input was a string that could not be cast to a date
Since
1.5.0
def days(e: Column): Column
(Java-specific) A transform for timestamps and dates to partition data into days.
(Java-specific) A transform for timestamps and dates to partition data into days.
Since
3.0.0
def decode(value: Column, charset: String): Column
Computes the first argument into a string from a binary using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', 'UTF-32').
Computes the first argument into a string from a binary using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', 'UTF-32'). If either argument is null, the result will also be null.
Since
1.5.0
def degrees(columnName: String): Column
Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
columnName
angle in radians
returns
angle in degrees, as if computed by java.lang.Math.toDegrees
Since
2.1.0
def degrees(e: Column): Column
Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
e
angle in radians
returns
angle in degrees, as if computed by java.lang.Math.toDegrees
Since
2.1.0
def dense_rank(): Column
Window function: returns the rank of rows within a window partition, without any gaps.
Window function: returns the rank of rows within a window partition, without any gaps.
The difference between rank and dense_rank is that denseRank leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using dense_rank and had three people tie for second place, you would say that all three were in second place and that the next person came in third. Rank would give me sequential numbers, making the person that came in third place (after the ties) would register as coming in fifth.
This is equivalent to the DENSE_RANK function in SQL.
Since
1.6.0
def desc(columnName: String): Column
Returns a sort expression based on the descending order of the column.
Returns a sort expression based on the descending order of the column.
```
df.sort(asc("dept"), desc("age"))
```
Since
1.3.0
def desc_nulls_first(columnName: String): Column
Returns a sort expression based on the descending order of the column, and null values appear before non-null values.
Returns a sort expression based on the descending order of the column, and null values appear before non-null values.
```
df.sort(asc("dept"), desc_nulls_first("age"))
```
Since
2.1.0
def desc_nulls_last(columnName: String): Column
Returns a sort expression based on the descending order of the column, and null values appear after non-null values.
Returns a sort expression based on the descending order of the column, and null values appear after non-null values.
```
df.sort(asc("dept"), desc_nulls_last("age"))
```
Since
2.1.0
def e(): Column
Returns Euler's number.
Returns Euler's number.
Since
3.5.0
def element_at(column: Column, value: Any): Column
Returns element of array at given index in value if column is array.
Returns element of array at given index in value if column is array. Returns value for the given key in value if column is map.
Since
2.4.0
def elt(inputs: Column*): Column
Returns the n-th input, e.g., returns input2 when n is 2.
Returns the n-th input, e.g., returns input2 when n is 2. The function returns NULL if the index exceeds the length of the array and spark.sql.ansi.enabled is set to false. If spark.sql.ansi.enabled is set to true, it throws ArrayIndexOutOfBoundsException for invalid indices.
Annotations
@varargs()
Since
3.5.0
def encode(value: Column, charset: String): Column
Computes the first argument into a binary from a string using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', 'UTF-32').
Computes the first argument into a binary from a string using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', 'UTF-32'). If either argument is null, the result will also be null.
Since
1.5.0
def endswith(str: Column, suffix: Column): Column
Returns a boolean.
Returns a boolean. The value is True if str ends with suffix. Returns NULL if either input expression is NULL. Otherwise, returns False. Both str or suffix must be of STRING or BINARY type.
Since
3.5.0
final def eq(arg0: AnyRef): Boolean
Definition Classes
AnyRef
def equal_null(col1: Column, col2: Column): Column
Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both are null, false if one of the them is null.
Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both are null, false if one of the them is null.
Since
3.5.0
def equals(arg0: AnyRef): Boolean
Definition Classes
AnyRef → Any
def every(e: Column): Column
Aggregate function: returns true if all values of e are true.
Aggregate function: returns true if all values of e are true.
Since
3.5.0
def exists(column: Column, f: (Column) => Column): Column
Returns whether a predicate holds for one or more elements in the array.
Returns whether a predicate holds for one or more elements in the array.
```
df.select(exists(col("i"), _ % 2 === 0))
```
column
the input array column
f
col => predicate, the Boolean predicate to check the input column
Since
3.0.0
def exp(columnName: String): Column
Computes the exponential of the given column.
Computes the exponential of the given column.
Since
1.4.0
def exp(e: Column): Column
Computes the exponential of the given value.
Computes the exponential of the given value.
Since
1.4.0
def explode(e: Column): Column
Creates a new row for each element in the given array or map column.
Creates a new row for each element in the given array or map column. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise.
Since
1.3.0
def explode_outer(e: Column): Column
Creates a new row for each element in the given array or map column.
Creates a new row for each element in the given array or map column. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise. Unlike explode, if the array/map is null or empty then null is produced.
Since
2.2.0
def expm1(columnName: String): Column
Computes the exponential of the given column minus one.
Computes the exponential of the given column minus one.
Since
1.4.0
def expm1(e: Column): Column
Computes the exponential of the given value minus one.
Computes the exponential of the given value minus one.
Since
1.4.0
def expr(expr: String): Column
Parses the expression string into the column that it represents, similar to Dataset#selectExpr.
Parses the expression string into the column that it represents, similar to Dataset#selectExpr.
```
// get the number of words of each length
df.groupBy(expr("length(word)")).count()
```
def extract(field: Column, source: Column): Column
Extracts a part of the date/timestamp or interval source.
Extracts a part of the date/timestamp or interval source.
field
selects which part of the source should be extracted.
source
a date/timestamp or interval column from where field should be extracted.
returns
a part of the date/timestamp or interval source
Since
3.5.0
def factorial(e: Column): Column
Computes the factorial of the given value.
Computes the factorial of the given value.
Since
1.5.0
def filter(column: Column, f: (Column, Column) => Column): Column
Returns an array of elements for which a predicate holds in a given array.
Returns an array of elements for which a predicate holds in a given array.
```
df.select(filter(col("s"), (x, i) => i % 2 === 0))
```
column
the input array column
f
(col, index) => predicate, the Boolean predicate to filter the input column given the index. Indices start at 0.
Since
3.0.0
def filter(column: Column, f: (Column) => Column): Column
Returns an array of elements for which a predicate holds in a given array.
Returns an array of elements for which a predicate holds in a given array.
```
df.select(filter(col("s"), x => x % 2 === 0))
```
column
the input array column
f
col => predicate, the Boolean predicate to filter the input column
Since
3.0.0
def find_in_set(str: Column, strArray: Column): Column
Returns the index (1-based) of the given string (str) in the comma-delimited list (strArray).
Returns the index (1-based) of the given string (str) in the comma-delimited list (strArray). Returns 0, if the string was not found or if the given string (str) contains a comma.
Since
3.5.0
def first(columnName: String): Column
Aggregate function: returns the first value of a column in a group.
Aggregate function: returns the first value of a column in a group.
The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
Since
1.3.0
Note
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
def first(e: Column): Column
Aggregate function: returns the first value in a group.
Aggregate function: returns the first value in a group.
The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
Since
1.3.0
Note
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
def first(columnName: String, ignoreNulls: Boolean): Column
Aggregate function: returns the first value of a column in a group.
Aggregate function: returns the first value of a column in a group.
The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
Since
2.0.0
Note
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
def first(e: Column, ignoreNulls: Boolean): Column
Aggregate function: returns the first value in a group.
Aggregate function: returns the first value in a group.
The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
Since
2.0.0
Note
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
def first_value(e: Column, ignoreNulls: Column): Column
Aggregate function: returns the first value in a group.
Aggregate function: returns the first value in a group.
The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
Since
3.5.0
Note
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
def first_value(e: Column): Column
Aggregate function: returns the first value in a group.
Aggregate function: returns the first value in a group.
Since
3.5.0
Note
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
def flatten(e: Column): Column
Creates a single array from an array of arrays.
Creates a single array from an array of arrays. If a structure of nested arrays is deeper than two levels, only one level of nesting is removed.
Since
2.4.0
def floor(columnName: String): Column
Computes the floor of the given column value to 0 decimal places.
Computes the floor of the given column value to 0 decimal places.
Since
1.4.0
def floor(e: Column): Column
Computes the floor of the given value of e to 0 decimal places.
Computes the floor of the given value of e to 0 decimal places.
Since
1.4.0
def floor(e: Column, scale: Column): Column
Computes the floor of the given value of e to scale decimal places.
Computes the floor of the given value of e to scale decimal places.
Since
3.3.0
def forall(column: Column, f: (Column) => Column): Column
Returns whether a predicate holds for every element in the array.
Returns whether a predicate holds for every element in the array.
```
df.select(forall(col("i"), x => x % 2 === 0))
```
column
the input array column
f
col => predicate, the Boolean predicate to check the input column
Since
3.0.0
def format_number(x: Column, d: Int): Column
Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column.
Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column.
If d is 0, the result has no decimal point or fractional part. If d is less than 0, the result will be null.
Since
1.5.0
def format_string(format: String, arguments: Column*): Column
Formats the arguments in printf-style and returns the result as a string column.
Formats the arguments in printf-style and returns the result as a string column.
Annotations
@varargs()
Since
1.5.0
def from_csv(e: Column, schema: Column, options: Map[String, String]): Column
(Java-specific) Parses a column containing a CSV string into a StructType with the specified schema.
(Java-specific) Parses a column containing a CSV string into a StructType with the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing CSV data.
schema
the schema to use when parsing the CSV string
options
options to control how the CSV is parsed. accepts the same options and the CSV data source. See Data Source Option in the version you use.
Since
3.0.0
def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column
Parses a column containing a CSV string into a StructType with the specified schema.
Parses a column containing a CSV string into a StructType with the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing CSV data.
schema
the schema to use when parsing the CSV string
options
options to control how the CSV is parsed. accepts the same options and the CSV data source. See Data Source Option in the version you use.
Since
3.0.0
def from_json(e: Column, schema: Column, options: Map[String, String]): Column
(Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType of StructTypes with the specified schema.
(Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType of StructTypes with the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing JSON data.
schema
the schema to use when parsing the json string
options
options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.
Since
2.4.0
def from_json(e: Column, schema: Column): Column
(Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType of StructTypes with the specified schema.
(Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType of StructTypes with the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing JSON data.
schema
the schema to use when parsing the json string
Since
2.4.0
def from_json(e: Column, schema: String, options: Map[String, String]): Column
(Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.
(Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing JSON data.
schema
the schema as a DDL-formatted string.
options
options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.
Since
2.3.0
def from_json(e: Column, schema: String, options: Map[String, String]): Column
(Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.
(Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing JSON data.
schema
the schema as a DDL-formatted string.
options
options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.
Since
2.1.0
def from_json(e: Column, schema: DataType): Column
Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.
Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing JSON data.
schema
the schema to use when parsing the json string
Since
2.2.0
def from_json(e: Column, schema: StructType): Column
Parses a column containing a JSON string into a StructType with the specified schema.
Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing JSON data.
schema
the schema to use when parsing the json string
Since
2.1.0
def from_json(e: Column, schema: DataType, options: Map[String, String]): Column
(Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.
(Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing JSON data.
schema
the schema to use when parsing the json string
options
options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.
Since
2.2.0
def from_json(e: Column, schema: StructType, options: Map[String, String]): Column
(Java-specific) Parses a column containing a JSON string into a StructType with the specified schema.
(Java-specific) Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing JSON data.
schema
the schema to use when parsing the json string
options
options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.
Since
2.1.0
def from_json(e: Column, schema: DataType, options: Map[String, String]): Column
(Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.
(Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing JSON data.
schema
the schema to use when parsing the json string
options
options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.
Since
2.2.0
def from_json(e: Column, schema: StructType, options: Map[String, String]): Column
(Scala-specific) Parses a column containing a JSON string into a StructType with the specified schema.
(Scala-specific) Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing JSON data.
schema
the schema to use when parsing the json string
options
options to control how the json is parsed. Accepts the same options as the json data source. See Data Source Option in the version you use.
Since
2.1.0
def from_unixtime(ut: Column, f: String): Column
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
See Datetime Patterns for valid date and time format patterns
ut
A number of a type that is castable to a long, such as string or integer. Can be negative for timestamps before the unix epoch
f
A date time pattern that the input will be formatted to
returns
A string, or null if ut was a string that could not be cast to a long or f was an invalid date time pattern
Since
1.5.0
def from_unixtime(ut: Column): Column
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format.
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format.
ut
A number of a type that is castable to a long, such as string or integer. Can be negative for timestamps before the unix epoch
returns
A string, or null if the input was a string that could not be cast to a long
Since
1.5.0
def from_utc_timestamp(ts: Column, tz: Column): Column
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone.
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, 'GMT+1' would yield '2017-07-14 03:40:00.0'.
Since
2.4.0
def from_utc_timestamp(ts: Column, tz: String): Column
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone.
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, 'GMT+1' would yield '2017-07-14 03:40:00.0'.
ts
A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
tz
A string detailing the time zone ID that the input should be adjusted to. It should be in the format of either region-based zone IDs or zone offsets. Region IDs must have the form 'area/city', such as 'America/Los_Angeles'. Zone offsets must be in the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'. Other short names are not recommended to use because they can be ambiguous.
returns
A timestamp, or null if ts was a string that could not be cast to a timestamp or tz was an invalid value
Since
1.5.0
def from_xml(e: Column, schema: StructType): Column
Parses a column containing a XML string into the data type corresponding to the specified schema.
Parses a column containing a XML string into the data type corresponding to the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing XML data.
schema
the schema to use when parsing the XML string
Since
4.0.0
def from_xml(e: Column, schema: Column, options: Map[String, String]): Column
(Java-specific) Parses a column containing a XML string into a StructType with the specified schema.
(Java-specific) Parses a column containing a XML string into a StructType with the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing XML data.
schema
the schema to use when parsing the XML string
options
options to control how the XML is parsed. accepts the same options and the XML data source. See Data Source Option in the version you use.
Since
4.0.0
def from_xml(e: Column, schema: Column): Column
(Java-specific) Parses a column containing a XML string into a StructType with the specified schema.
(Java-specific) Parses a column containing a XML string into a StructType with the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing XML data.
schema
the schema to use when parsing the XML string
Since
4.0.0
def from_xml(e: Column, schema: String, options: Map[String, String]): Column
(Java-specific) Parses a column containing a XML string into a StructType with the specified schema.
(Java-specific) Parses a column containing a XML string into a StructType with the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing XML data.
schema
the schema as a DDL-formatted string.
options
options to control how the XML is parsed. accepts the same options and the xml data source. See Data Source Option in the version you use.
Since
4.0.0
def from_xml(e: Column, schema: StructType, options: Map[String, String]): Column
Parses a column containing a XML string into the data type corresponding to the specified schema.
Parses a column containing a XML string into the data type corresponding to the specified schema. Returns null, in the case of an unparseable string.
e
a string column containing XML data.
schema
the schema to use when parsing the XML string
options
options to control how the XML is parsed. accepts the same options and the XML data source. See Data Source Option in the version you use.
Since
4.0.0
def get(column: Column, index: Column): Column
Returns element of array at given (0-based) index.
Returns element of array at given (0-based) index. If the index points outside of the array boundaries, then this function returns NULL.
Since
3.4.0
final def getClass(): Class[_ <: AnyRef]
Definition Classes
AnyRef → Any
Annotations
@IntrinsicCandidate() @native()
def get_json_object(e: Column, path: String): Column
Extracts json object from a json string based on json path specified, and returns json string of the extracted json object.
Extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid.
Since
1.6.0
def getbit(e: Column, pos: Column): Column
Returns the value of the bit (0 or 1) at the specified position.
Returns the value of the bit (0 or 1) at the specified position. The positions are numbered from right to left, starting at zero. The position argument cannot be negative.
Since
3.5.0
def greatest(columnName: String, columnNames: String*): Column
Returns the greatest value of the list of column names, skipping null values.
Returns the greatest value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
Annotations
@varargs()
Since
1.5.0
def greatest(exprs: Column*): Column
Returns the greatest value of the list of values, skipping null values.
Returns the greatest value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
Annotations
@varargs()
Since
1.5.0
def grouping(columnName: String): Column
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
Since
2.0.0
def grouping(e: Column): Column
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
Since
2.0.0
def grouping_id(colName: String, colNames: String*): Column
Aggregate function: returns the level of grouping, equals to
Aggregate function: returns the level of grouping, equals to
```
(grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn)
```
Annotations
@varargs()
Since
2.0.0
Note
The list of columns should match with grouping columns exactly.
def grouping_id(cols: Column*): Column
Aggregate function: returns the level of grouping, equals to
Aggregate function: returns the level of grouping, equals to
```
(grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn)
```
Annotations
@varargs()
Since
2.0.0
Note
The list of columns should match with grouping columns exactly, or empty (means all the grouping columns).
def hash(cols: Column*): Column
Calculates the hash code of given columns, and returns the result as an int column.
Calculates the hash code of given columns, and returns the result as an int column.
Annotations
@varargs()
Since
2.0.0
def hashCode(): Int
Definition Classes
AnyRef → Any
Annotations
@IntrinsicCandidate() @native()
def hex(column: Column): Column
Computes hex value of the given column.
Computes hex value of the given column.
Since
1.5.0
def histogram_numeric(e: Column, nBins: Column): Column
Aggregate function: computes a histogram on numeric 'expr' using nb bins.
Aggregate function: computes a histogram on numeric 'expr' using nb bins. The return value is an array of (x,y) pairs representing the centers of the histogram's bins. As the value of 'nb' is increased, the histogram approximation gets finer-grained, but may yield artifacts around outliers. In practice, 20-40 histogram bins appear to work well, with more bins being required for skewed or smaller datasets. Note that this function creates a histogram with non-uniform bin widths. It offers no guarantees in terms of the mean-squared-error of the histogram, but in practice is comparable to the histograms produced by the R/S-Plus statistical computing packages. Note: the output type of the 'x' field in the return value is propagated from the input value consumed in the aggregate function.
Since
3.5.0
def hll_sketch_agg(columnName: String): Column
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with default lgConfigK value.
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with default lgConfigK value.
Since
3.5.0
def hll_sketch_agg(e: Column): Column
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with default lgConfigK value.
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with default lgConfigK value.
Since
3.5.0
def hll_sketch_agg(columnName: String, lgConfigK: Int): Column
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with lgConfigK arg.
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with lgConfigK arg.
Since
3.5.0
def hll_sketch_agg(e: Column, lgConfigK: Int): Column
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with lgConfigK arg.
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with lgConfigK arg.
Since
3.5.0
def hll_sketch_agg(e: Column, lgConfigK: Column): Column
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with lgConfigK arg.
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with lgConfigK arg.
Since
3.5.0
def hll_sketch_estimate(columnName: String): Column
Returns the estimated number of unique values given the binary representation of a Datasketches HllSketch.
Returns the estimated number of unique values given the binary representation of a Datasketches HllSketch.
Since
3.5.0
def hll_sketch_estimate(c: Column): Column
Returns the estimated number of unique values given the binary representation of a Datasketches HllSketch.
Returns the estimated number of unique values given the binary representation of a Datasketches HllSketch.
Since
3.5.0
def hll_union(columnName1: String, columnName2: String, allowDifferentLgConfigK: Boolean): Column
Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object.
Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object. Throws an exception if sketches have different lgConfigK values and allowDifferentLgConfigK is set to false.
Since
3.5.0
def hll_union(c1: Column, c2: Column, allowDifferentLgConfigK: Boolean): Column
Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object.
Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object. Throws an exception if sketches have different lgConfigK values and allowDifferentLgConfigK is set to false.
Since
3.5.0
def hll_union(columnName1: String, columnName2: String): Column
Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object.
Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object. Throws an exception if sketches have different lgConfigK values.
Since
3.5.0
def hll_union(c1: Column, c2: Column): Column
Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object.
Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object. Throws an exception if sketches have different lgConfigK values.
Since
3.5.0
def hll_union_agg(columnName: String): Column
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance.
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance. Throws an exception if sketches have different lgConfigK values.
Since
3.5.0
def hll_union_agg(e: Column): Column
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance.
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance. Throws an exception if sketches have different lgConfigK values.
Since
3.5.0
def hll_union_agg(columnName: String, allowDifferentLgConfigK: Boolean): Column
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance.
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance. Throws an exception if sketches have different lgConfigK values and allowDifferentLgConfigK is set to false.
Since
3.5.0
def hll_union_agg(e: Column, allowDifferentLgConfigK: Boolean): Column
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance.
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance. Throws an exception if sketches have different lgConfigK values and allowDifferentLgConfigK is set to false.
Since
3.5.0
def hll_union_agg(e: Column, allowDifferentLgConfigK: Column): Column
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance.
Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance. Throws an exception if sketches have different lgConfigK values and allowDifferentLgConfigK is set to false.
Since
3.5.0
def hour(e: Column): Column
Extracts the hours as an integer from a given date/time/timestamp/string.
Extracts the hours as an integer from a given date/time/timestamp/string.
returns
An integer, or null if the input was a string that could not be cast to a date
Since
1.5.0
def hours(e: Column): Column
(Java-specific) A transform for timestamps to partition data into hours.
(Java-specific) A transform for timestamps to partition data into hours.
Since
3.0.0
def hypot(l: Double, rightName: String): Column
Computes sqrt(a² + b²) without intermediate overflow or underflow.
Computes sqrt(a² + b²) without intermediate overflow or underflow.
Since
1.4.0
def hypot(l: Double, r: Column): Column
Computes sqrt(a² + b²) without intermediate overflow or underflow.
Computes sqrt(a² + b²) without intermediate overflow or underflow.
Since
1.4.0
def hypot(leftName: String, r: Double): Column
Computes sqrt(a² + b²) without intermediate overflow or underflow.
Computes sqrt(a² + b²) without intermediate overflow or underflow.
Since
1.4.0
def hypot(l: Column, r: Double): Column
Computes sqrt(a² + b²) without intermediate overflow or underflow.
Computes sqrt(a² + b²) without intermediate overflow or underflow.
Since
1.4.0
def hypot(leftName: String, rightName: String): Column
Computes sqrt(a² + b²) without intermediate overflow or underflow.
Computes sqrt(a² + b²) without intermediate overflow or underflow.
Since
1.4.0
def hypot(leftName: String, r: Column): Column
Computes sqrt(a² + b²) without intermediate overflow or underflow.
Computes sqrt(a² + b²) without intermediate overflow or underflow.
Since
1.4.0
def hypot(l: Column, rightName: String): Column
Computes sqrt(a² + b²) without intermediate overflow or underflow.
Computes sqrt(a² + b²) without intermediate overflow or underflow.
Since
1.4.0
def hypot(l: Column, r: Column): Column
Computes sqrt(a² + b²) without intermediate overflow or underflow.
Computes sqrt(a² + b²) without intermediate overflow or underflow.
Since
1.4.0
def ifnull(col1: Column, col2: Column): Column
Returns col2 if col1 is null, or col1 otherwise.
Returns col2 if col1 is null, or col1 otherwise.
Since
3.5.0
def ilike(str: Column, pattern: Column): Column
Returns true if str matches pattern with escapeChar('\') case-insensitively, null if any arguments are null, false otherwise.
Returns true if str matches pattern with escapeChar('\') case-insensitively, null if any arguments are null, false otherwise.
Since
3.5.0
def ilike(str: Column, pattern: Column, escapeChar: Column): Column
Returns true if str matches pattern with escapeChar case-insensitively, null if any arguments are null, false otherwise.
Returns true if str matches pattern with escapeChar case-insensitively, null if any arguments are null, false otherwise.
Since
3.5.0
def initcap(e: Column): Column
Returns a new string column by converting the first letter of each word to uppercase.
Returns a new string column by converting the first letter of each word to uppercase. Words are delimited by whitespace.
For example, "hello world" will become "Hello World".
Since
1.5.0
def inline(e: Column): Column
Creates a new row for each element in the given array of structs.
Creates a new row for each element in the given array of structs.
Since
3.4.0
def inline_outer(e: Column): Column
Creates a new row for each element in the given array of structs.
Creates a new row for each element in the given array of structs. Unlike inline, if the array is null or empty then null is produced for each nested column.
Since
3.4.0
def input_file_block_length(): Column
Returns the length of the block being read, or -1 if not available.
Returns the length of the block being read, or -1 if not available.
Since
3.5.0
def input_file_block_start(): Column
Returns the start offset of the block being read, or -1 if not available.
Returns the start offset of the block being read, or -1 if not available.
Since
3.5.0
def input_file_name(): Column
Creates a string column for the file name of the current Spark task.
Creates a string column for the file name of the current Spark task.
Since
1.6.0
def instr(str: Column, substring: Column): Column
Locate the position of the first occurrence of substr column in the given string.
Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.
Since
4.0.0
Note
The position is not zero based, but 1 based index. Returns 0 if substr could not be found in str.
def instr(str: Column, substring: String): Column
Locate the position of the first occurrence of substr column in the given string.
Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.
Since
1.5.0
Note
The position is not zero based, but 1 based index. Returns 0 if substr could not be found in str.
final def isInstanceOf[T0]: Boolean
Definition Classes
Any
def is_valid_utf8(str: Column): Column
Returns true if the input is a valid UTF-8 string, otherwise returns false.
Returns true if the input is a valid UTF-8 string, otherwise returns false.
Since
4.0.0
def is_variant_null(v: Column): Column
Check if a variant value is a variant null.
Check if a variant value is a variant null. Returns true if and only if the input is a variant null and false otherwise (including in the case of SQL NULL).
v
a variant column.
Since
4.0.0
def isnan(e: Column): Column
Return true iff the column is NaN.
Return true iff the column is NaN.
Since
1.6.0
def isnotnull(col: Column): Column
Returns true if col is not null, or false otherwise.
Returns true if col is not null, or false otherwise.
Since
3.5.0
def isnull(e: Column): Column
Return true iff the column is null.
Return true iff the column is null.
Since
1.6.0
def java_method(cols: Column*): Column
Calls a method with reflection.
Calls a method with reflection.
Annotations
@varargs()
Since
3.5.0
def json_array_length(e: Column): Column
Returns the number of elements in the outermost JSON array.
Returns the number of elements in the outermost JSON array. NULL is returned in case of any other valid JSON string, NULL or an invalid JSON.
Since
3.5.0
def json_object_keys(e: Column): Column
Returns all the keys of the outermost JSON object as an array.
Returns all the keys of the outermost JSON object as an array. If a valid JSON object is given, all the keys of the outermost object will be returned as an array. If it is any other valid JSON string, an invalid JSON string or an empty string, the function returns null.
Since
3.5.0
def json_tuple(json: Column, fields: String*): Column
Creates a new row for a json column according to the given field names.
Creates a new row for a json column according to the given field names.
Annotations
@varargs()
Since
1.6.0
def kurtosis(columnName: String): Column
Aggregate function: returns the kurtosis of the values in a group.
Aggregate function: returns the kurtosis of the values in a group.
Since
1.6.0
def kurtosis(e: Column): Column
Aggregate function: returns the kurtosis of the values in a group.
Aggregate function: returns the kurtosis of the values in a group.
Since
1.6.0
def lag(e: Column, offset: Int, defaultValue: Any, ignoreNulls: Boolean): Column
Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row.
Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. ignoreNulls determines whether null values of row are included in or eliminated from the calculation. For example, an offset of one will return the previous row at any given point in the window partition.
This is equivalent to the LAG function in SQL.
Since
3.2.0
def lag(e: Column, offset: Int, defaultValue: Any): Column
Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row.
Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.
This is equivalent to the LAG function in SQL.
Since
1.4.0
def lag(columnName: String, offset: Int, defaultValue: Any): Column
Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row.
Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.
This is equivalent to the LAG function in SQL.
Since
1.4.0
def lag(columnName: String, offset: Int): Column
Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row.
Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.
This is equivalent to the LAG function in SQL.
Since
1.4.0
def lag(e: Column, offset: Int): Column
Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row.
Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.
This is equivalent to the LAG function in SQL.
Since
1.4.0
def last(columnName: String): Column
Aggregate function: returns the last value of the column in a group.
Aggregate function: returns the last value of the column in a group.
The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
Since
1.3.0
Note
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
def last(e: Column): Column
Aggregate function: returns the last value in a group.
Aggregate function: returns the last value in a group.
The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
Since
1.3.0
Note
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
def last(columnName: String, ignoreNulls: Boolean): Column
Aggregate function: returns the last value of the column in a group.
Aggregate function: returns the last value of the column in a group.
The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
Since
2.0.0
Note
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
def last(e: Column, ignoreNulls: Boolean): Column
Aggregate function: returns the last value in a group.
Aggregate function: returns the last value in a group.
The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
Since
2.0.0
Note
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
def last_day(e: Column): Column
Returns the last day of the month which the given date belongs to.
Returns the last day of the month which the given date belongs to. For example, input "2015-07-27" returns "2015-07-31" since July 31 is the last day of the month in July 2015.
e
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
returns
A date, or null if the input was a string that could not be cast to a date
Since
1.5.0
def last_value(e: Column, ignoreNulls: Column): Column
Aggregate function: returns the last value in a group.
Aggregate function: returns the last value in a group.
The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
Since
3.5.0
Note
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
def last_value(e: Column): Column
Aggregate function: returns the last value in a group.
Aggregate function: returns the last value in a group.
Since
3.5.0
Note
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
def lcase(str: Column): Column
Returns str with all characters changed to lowercase.
Returns str with all characters changed to lowercase.
Since
3.5.0
def lead(e: Column, offset: Int, defaultValue: Any, ignoreNulls: Boolean): Column
Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row.
Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row. ignoreNulls determines whether null values of row are included in or eliminated from the calculation. The default value of ignoreNulls is false. For example, an offset of one will return the next row at any given point in the window partition.
This is equivalent to the LEAD function in SQL.
Since
3.2.0
def lead(e: Column, offset: Int, defaultValue: Any): Column
Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row.
Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.
This is equivalent to the LEAD function in SQL.
Since
1.4.0
def lead(columnName: String, offset: Int, defaultValue: Any): Column
Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row.
Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.
This is equivalent to the LEAD function in SQL.
Since
1.4.0
def lead(e: Column, offset: Int): Column
Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row.
Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.
This is equivalent to the LEAD function in SQL.
Since
1.4.0
def lead(columnName: String, offset: Int): Column
Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row.
Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.
This is equivalent to the LEAD function in SQL.
Since
1.4.0
def least(columnName: String, columnNames: String*): Column
Returns the least value of the list of column names, skipping null values.
Returns the least value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
Annotations
@varargs()
Since
1.5.0
def least(exprs: Column*): Column
Returns the least value of the list of values, skipping null values.
Returns the least value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
Annotations
@varargs()
Since
1.5.0
def left(str: Column, len: Column): Column
Returns the leftmost len(len can be string type) characters from the string str, if len is less or equal than 0 the result is an empty string.
Returns the leftmost len(len can be string type) characters from the string str, if len is less or equal than 0 the result is an empty string.
Since
3.5.0
def len(e: Column): Column
Computes the character length of a given string or number of bytes of a binary string.
Computes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros.
Since
3.5.0
def length(e: Column): Column
Computes the character length of a given string or number of bytes of a binary string.
Computes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros.
Since
1.5.0
def levenshtein(l: Column, r: Column): Column
Computes the Levenshtein distance of the two given string columns.
Computes the Levenshtein distance of the two given string columns.
Since
1.5.0
def levenshtein(l: Column, r: Column, threshold: Int): Column
Computes the Levenshtein distance of the two given string columns if it's less than or equal to a given threshold.
Computes the Levenshtein distance of the two given string columns if it's less than or equal to a given threshold.
returns
result distance, or -1
Since
3.5.0
def like(str: Column, pattern: Column): Column
Returns true if str matches pattern with escapeChar('\'), null if any arguments are null, false otherwise.
Returns true if str matches pattern with escapeChar('\'), null if any arguments are null, false otherwise.
Since
3.5.0
def like(str: Column, pattern: Column, escapeChar: Column): Column
Returns true if str matches pattern with escapeChar, null if any arguments are null, false otherwise.
Returns true if str matches pattern with escapeChar, null if any arguments are null, false otherwise.
Since
3.5.0
def listagg(e: Column, delimiter: Column): Column
Aggregate function: returns the concatenation of non-null input values, separated by the delimiter.
Aggregate function: returns the concatenation of non-null input values, separated by the delimiter.
Since
4.0.0
def listagg(e: Column): Column
Aggregate function: returns the concatenation of non-null input values.
Aggregate function: returns the concatenation of non-null input values.
Since
4.0.0
def listagg_distinct(e: Column, delimiter: Column): Column
Aggregate function: returns the concatenation of distinct non-null input values, separated by the delimiter.
Aggregate function: returns the concatenation of distinct non-null input values, separated by the delimiter.
Since
4.0.0
def listagg_distinct(e: Column): Column
Aggregate function: returns the concatenation of distinct non-null input values.
Aggregate function: returns the concatenation of distinct non-null input values.
Since
4.0.0
def lit(literal: Any): Column
Creates a Column of literal value.
Creates a Column of literal value.
The passed in object is returned directly if it is already a Column. If the object is a Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value.
Since
1.3.0
def ln(e: Column): Column
Computes the natural logarithm of the given value.
Computes the natural logarithm of the given value.
Since
3.5.0
def localtimestamp(): Column
Returns the current timestamp without time zone at the start of query evaluation as a timestamp without time zone column.
Returns the current timestamp without time zone at the start of query evaluation as a timestamp without time zone column. All calls of localtimestamp within the same query return the same value.
Since
3.3.0
def locate(substr: String, str: Column, pos: Int): Column
Locate the position of the first occurrence of substr in a string column, after position pos.
Locate the position of the first occurrence of substr in a string column, after position pos.
Since
1.5.0
Note
The position is not zero based, but 1 based index. returns 0 if substr could not be found in str.
def locate(substr: String, str: Column): Column
Locate the position of the first occurrence of substr.
Locate the position of the first occurrence of substr.
Since
1.5.0
Note
The position is not zero based, but 1 based index. Returns 0 if substr could not be found in str.
def log(base: Double, columnName: String): Column
Returns the first argument-base logarithm of the second argument.
Returns the first argument-base logarithm of the second argument.
Since
1.4.0
def log(base: Double, a: Column): Column
Returns the first argument-base logarithm of the second argument.
Returns the first argument-base logarithm of the second argument.
Since
1.4.0
def log(columnName: String): Column
Computes the natural logarithm of the given column.
Computes the natural logarithm of the given column.
Since
1.4.0
def log(e: Column): Column
Computes the natural logarithm of the given value.
Computes the natural logarithm of the given value.
Since
1.4.0
def log10(columnName: String): Column
Computes the logarithm of the given value in base 10.
Computes the logarithm of the given value in base 10.
Since
1.4.0
def log10(e: Column): Column
Computes the logarithm of the given value in base 10.
Computes the logarithm of the given value in base 10.
Since
1.4.0
def log1p(columnName: String): Column
Computes the natural logarithm of the given column plus one.
Computes the natural logarithm of the given column plus one.
Since
1.4.0
def log1p(e: Column): Column
Computes the natural logarithm of the given value plus one.
Computes the natural logarithm of the given value plus one.
Since
1.4.0
def log2(columnName: String): Column
Computes the logarithm of the given value in base 2.
Computes the logarithm of the given value in base 2.
Since
1.5.0
def log2(expr: Column): Column
Computes the logarithm of the given column in base 2.
Computes the logarithm of the given column in base 2.
Since
1.5.0
def lower(e: Column): Column
Converts a string column to lower case.
Converts a string column to lower case.
Since
1.3.0
def lpad(str: Column, len: Column, pad: Column): Column
Left-pad the string column with pad to a length of len.
Left-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.
Since
4.0.0
def lpad(str: Column, len: Int, pad: Array[Byte]): Column
Left-pad the binary column with pad to a byte length of len.
Left-pad the binary column with pad to a byte length of len. If the binary column is longer than len, the return value is shortened to len bytes.
Since
3.3.0
def lpad(str: Column, len: Int, pad: String): Column
Left-pad the string column with pad to a length of len.
Left-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.
Since
1.5.0
def ltrim(e: Column, trim: Column): Column
Trim the specified character string from left end for the specified string column.
Trim the specified character string from left end for the specified string column.
Since
4.0.0
def ltrim(e: Column, trimString: String): Column
Trim the specified character string from left end for the specified string column.
Trim the specified character string from left end for the specified string column.
Since
2.3.0
def ltrim(e: Column): Column
Trim the spaces from left end for the specified string value.
Trim the spaces from left end for the specified string value.
Since
1.5.0
def make_date(year: Column, month: Column, day: Column): Column
returns
A date created from year, month and day fields.
Since
3.3.0
def make_dt_interval(): Column
Make DayTimeIntervalType duration.
Make DayTimeIntervalType duration.
Since
3.5.0
def make_dt_interval(days: Column): Column
Make DayTimeIntervalType duration from days.
Make DayTimeIntervalType duration from days.
Since
3.5.0
def make_dt_interval(days: Column, hours: Column): Column
Make DayTimeIntervalType duration from days and hours.
Make DayTimeIntervalType duration from days and hours.
Since
3.5.0
def make_dt_interval(days: Column, hours: Column, mins: Column): Column
Make DayTimeIntervalType duration from days, hours and mins.
Make DayTimeIntervalType duration from days, hours and mins.
Since
3.5.0
def make_dt_interval(days: Column, hours: Column, mins: Column, secs: Column): Column
Make DayTimeIntervalType duration from days, hours, mins and secs.
Make DayTimeIntervalType duration from days, hours, mins and secs.
Since
3.5.0
def make_interval(): Column
Make interval.
Make interval.
Since
3.5.0
def make_interval(years: Column): Column
Make interval from years.
Make interval from years.
Since
3.5.0
def make_interval(years: Column, months: Column): Column
Make interval from years and months.
Make interval from years and months.
Since
3.5.0
def make_interval(years: Column, months: Column, weeks: Column): Column
Make interval from years, months and weeks.
Make interval from years, months and weeks.
Since
3.5.0
def make_interval(years: Column, months: Column, weeks: Column, days: Column): Column
Make interval from years, months, weeks and days.
Make interval from years, months, weeks and days.
Since
3.5.0
def make_interval(years: Column, months: Column, weeks: Column, days: Column, hours: Column): Column
Make interval from years, months, weeks, days and hours.
Make interval from years, months, weeks, days and hours.
Since
3.5.0
def make_interval(years: Column, months: Column, weeks: Column, days: Column, hours: Column, mins: Column): Column
Make interval from years, months, weeks, days, hours and mins.
Make interval from years, months, weeks, days, hours and mins.
Since
3.5.0
def make_interval(years: Column, months: Column, weeks: Column, days: Column, hours: Column, mins: Column, secs: Column): Column
Make interval from years, months, weeks, days, hours, mins and secs.
Make interval from years, months, weeks, days, hours, mins and secs.
Since
3.5.0
def make_time(hour: Column, minute: Column, second: Column): Column
Create time from hour, minute and second fields.
Create time from hour, minute and second fields. For invalid inputs it will throw an error.
hour
the hour to represent, from 0 to 23
minute
the minute to represent, from 0 to 59
second
the second to represent, from 0 to 59.999999
Since
4.1.0
def make_timestamp(date: Column, time: Column): Column
Create a local date-time from date and time fields.
Create a local date-time from date and time fields.
Since
4.1.0
def make_timestamp(date: Column, time: Column, timezone: Column): Column
Create a local date-time from date, time, and timezone fields.
Create a local date-time from date, time, and timezone fields.
Since
4.1.0
def make_timestamp(years: Column, months: Column, days: Column, hours: Column, mins: Column, secs: Column): Column
Create timestamp from years, months, days, hours, mins and secs fields.
Create timestamp from years, months, days, hours, mins and secs fields. The result data type is consistent with the value of configuration spark.sql.timestampType. If the configuration spark.sql.ansi.enabled is false, the function returns NULL on invalid inputs. Otherwise, it will throw an error instead.
Since
3.5.0
def make_timestamp(years: Column, months: Column, days: Column, hours: Column, mins: Column, secs: Column, timezone: Column): Column
Create timestamp from years, months, days, hours, mins, secs and timezone fields.
Create timestamp from years, months, days, hours, mins, secs and timezone fields. The result data type is consistent with the value of configuration spark.sql.timestampType. If the configuration spark.sql.ansi.enabled is false, the function returns NULL on invalid inputs. Otherwise, it will throw an error instead.
Since
3.5.0
def make_timestamp_ltz(years: Column, months: Column, days: Column, hours: Column, mins: Column, secs: Column): Column
Create the current timestamp with local time zone from years, months, days, hours, mins and secs fields.
Create the current timestamp with local time zone from years, months, days, hours, mins and secs fields. If the configuration spark.sql.ansi.enabled is false, the function returns NULL on invalid inputs. Otherwise, it will throw an error instead.
Since
3.5.0
def make_timestamp_ltz(years: Column, months: Column, days: Column, hours: Column, mins: Column, secs: Column, timezone: Column): Column
Create the current timestamp with local time zone from years, months, days, hours, mins, secs and timezone fields.
Create the current timestamp with local time zone from years, months, days, hours, mins, secs and timezone fields. If the configuration spark.sql.ansi.enabled is false, the function returns NULL on invalid inputs. Otherwise, it will throw an error instead.
Since
3.5.0
def make_timestamp_ntz(date: Column, time: Column): Column
Create a local date-time from date and time fields.
Create a local date-time from date and time fields.
Since
4.1.0
def make_timestamp_ntz(years: Column, months: Column, days: Column, hours: Column, mins: Column, secs: Column): Column
Create local date-time from years, months, days, hours, mins, secs fields.
Create local date-time from years, months, days, hours, mins, secs fields. If the configuration spark.sql.ansi.enabled is false, the function returns NULL on invalid inputs. Otherwise, it will throw an error instead.
Since
3.5.0
def make_valid_utf8(str: Column): Column
Returns a new string in which all invalid UTF-8 byte sequences, if any, are replaced by the Unicode replacement character (U+FFFD).
Returns a new string in which all invalid UTF-8 byte sequences, if any, are replaced by the Unicode replacement character (U+FFFD).
Since
4.0.0
def make_ym_interval(): Column
Make year-month interval.
Make year-month interval.
Since
3.5.0
def make_ym_interval(years: Column): Column
Make year-month interval from years.
Make year-month interval from years.
Since
3.5.0
def make_ym_interval(years: Column, months: Column): Column
Make year-month interval from years, months.
Make year-month interval from years, months.
Since
3.5.0
def map(cols: Column*): Column
Creates a new map column.
Creates a new map column. The input columns must be grouped as key-value pairs, e.g. (key1, value1, key2, value2, ...). The key columns must all have the same data type, and can't be null. The value columns must all have the same data type.
Annotations
@varargs()
Since
2.0
def map_concat(cols: Column*): Column
Returns the union of all the given maps.
Returns the union of all the given maps.
Annotations
@varargs()
Since
2.4.0
def map_contains_key(column: Column, key: Any): Column
Returns true if the map contains the key.
Returns true if the map contains the key.
Since
3.3.0
def map_entries(e: Column): Column
Returns an unordered array of all entries in the given map.
Returns an unordered array of all entries in the given map.
Since
3.0.0
def map_filter(expr: Column, f: (Column, Column) => Column): Column
Returns a map whose key-value pairs satisfy a predicate.
Returns a map whose key-value pairs satisfy a predicate.
```
df.select(map_filter(col("m"), (k, v) => k * 10 === v))
```
expr
the input map column
f
(key, value) => predicate, the Boolean predicate to filter the input map column
Since
3.0.0
def map_from_arrays(keys: Column, values: Column): Column
Creates a new map column.
Creates a new map column. The array in the first column is used for keys. The array in the second column is used for values. All elements in the array for key should not be null.
Since
2.4
def map_from_entries(e: Column): Column
Returns a map created from the given array of entries.
Returns a map created from the given array of entries.
Since
2.4.0
def map_keys(e: Column): Column
Returns an unordered array containing the keys of the map.
Returns an unordered array containing the keys of the map.
Since
2.3.0
def map_values(e: Column): Column
Returns an unordered array containing the values of the map.
Returns an unordered array containing the values of the map.
Since
2.3.0
def map_zip_with(left: Column, right: Column, f: (Column, Column, Column) => Column): Column
Merge two given maps, key-wise into a single map using a function.
Merge two given maps, key-wise into a single map using a function.
```
df.select(map_zip_with(df("m1"), df("m2"), (k, v1, v2) => k === v1 + v2))
```
left
the left input map column
right
the right input map column
f
(key, value1, value2) => new_value, the lambda function to merge the map values
Since
3.0.0
def mask(input: Column, upperChar: Column, lowerChar: Column, digitChar: Column, otherChar: Column): Column
Masks the given string value.
Masks the given string value. This can be useful for creating copies of tables with sensitive information removed.
input
string value to mask. Supported types: STRING, VARCHAR, CHAR
upperChar
character to replace upper-case characters with. Specify NULL to retain original character.
lowerChar
character to replace lower-case characters with. Specify NULL to retain original character.
digitChar
character to replace digit characters with. Specify NULL to retain original character.
otherChar
character to replace all other characters with. Specify NULL to retain original character.
Since
3.5.0
def mask(input: Column, upperChar: Column, lowerChar: Column, digitChar: Column): Column
Masks the given string value.
Masks the given string value. The function replaces upper-case, lower-case characters and numbers with the characters specified respectively. This can be useful for creating copies of tables with sensitive information removed.
input
string value to mask. Supported types: STRING, VARCHAR, CHAR
upperChar
character to replace upper-case characters with. Specify NULL to retain original character.
lowerChar
character to replace lower-case characters with. Specify NULL to retain original character.
digitChar
character to replace digit characters with. Specify NULL to retain original character.
Since
3.5.0
def mask(input: Column, upperChar: Column, lowerChar: Column): Column
Masks the given string value.
Masks the given string value. The function replaces upper-case and lower-case characters with the characters specified respectively, and numbers with 'n'. This can be useful for creating copies of tables with sensitive information removed.
input
string value to mask. Supported types: STRING, VARCHAR, CHAR
upperChar
character to replace upper-case characters with. Specify NULL to retain original character.
lowerChar
character to replace lower-case characters with. Specify NULL to retain original character.
Since
3.5.0
def mask(input: Column, upperChar: Column): Column
Masks the given string value.
Masks the given string value. The function replaces upper-case characters with specific character, lower-case characters with 'x', and numbers with 'n'. This can be useful for creating copies of tables with sensitive information removed.
input
string value to mask. Supported types: STRING, VARCHAR, CHAR
upperChar
character to replace upper-case characters with. Specify NULL to retain original character.
Since
3.5.0
def mask(input: Column): Column
Masks the given string value.
Masks the given string value. The function replaces characters with 'X' or 'x', and numbers with 'n'. This can be useful for creating copies of tables with sensitive information removed.
input
string value to mask. Supported types: STRING, VARCHAR, CHAR
Since
3.5.0
def max(columnName: String): Column
Aggregate function: returns the maximum value of the column in a group.
Aggregate function: returns the maximum value of the column in a group.
Since
1.3.0
def max(e: Column): Column
Aggregate function: returns the maximum value of the expression in a group.
Aggregate function: returns the maximum value of the expression in a group.
Since
1.3.0
def max_by(e: Column, ord: Column): Column
Aggregate function: returns the value associated with the maximum value of ord.
Aggregate function: returns the value associated with the maximum value of ord.
Since
3.3.0
Note
The function is non-deterministic so the output order can be different for those associated the same values of e.
def md5(e: Column): Column
Calculates the MD5 digest of a binary column and returns the value as a 32 character hex string.
Calculates the MD5 digest of a binary column and returns the value as a 32 character hex string.
Since
1.5.0
def mean(columnName: String): Column
Aggregate function: returns the average of the values in a group.
Aggregate function: returns the average of the values in a group. Alias for avg.
Since
1.4.0
def mean(e: Column): Column
Aggregate function: returns the average of the values in a group.
Aggregate function: returns the average of the values in a group. Alias for avg.
Since
1.4.0
def median(e: Column): Column
Aggregate function: returns the median of the values in a group.
Aggregate function: returns the median of the values in a group.
Since
3.4.0
def min(columnName: String): Column
Aggregate function: returns the minimum value of the column in a group.
Aggregate function: returns the minimum value of the column in a group.
Since
1.3.0
def min(e: Column): Column
Aggregate function: returns the minimum value of the expression in a group.
Aggregate function: returns the minimum value of the expression in a group.
Since
1.3.0
def min_by(e: Column, ord: Column): Column
Aggregate function: returns the value associated with the minimum value of ord.
Aggregate function: returns the value associated with the minimum value of ord.
Since
3.3.0
Note
The function is non-deterministic so the output order can be different for those associated the same values of e.
def minute(e: Column): Column
Extracts the minutes as an integer from a given date/time/timestamp/string.
Extracts the minutes as an integer from a given date/time/timestamp/string.
returns
An integer, or null if the input was a string that could not be cast to a date
Since
1.5.0
def mode(e: Column, deterministic: Boolean): Column
Aggregate function: returns the most frequent value in a group.
Aggregate function: returns the most frequent value in a group.
When multiple values have the same greatest frequency then either any of values is returned if deterministic is false or is not defined, or the lowest value is returned if deterministic is true.
Since
4.0.0
def mode(e: Column): Column
Aggregate function: returns the most frequent value in a group.
Aggregate function: returns the most frequent value in a group.
Since
3.4.0
def monotonically_increasing_id(): Column
A column expression that generates monotonically increasing 64-bit integers.
A column expression that generates monotonically increasing 64-bit integers.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs:
```
0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
```
Since
1.6.0
def month(e: Column): Column
Extracts the month as an integer from a given date/timestamp/string.
Extracts the month as an integer from a given date/timestamp/string.
returns
An integer, or null if the input was a string that could not be cast to a date
Since
1.5.0
def monthname(timeExp: Column): Column
Extracts the three-letter abbreviated month name from a given date/timestamp/string.
Extracts the three-letter abbreviated month name from a given date/timestamp/string.
Since
4.0.0
def months(e: Column): Column
(Java-specific) A transform for timestamps and dates to partition data into months.
(Java-specific) A transform for timestamps and dates to partition data into months.
Since
3.0.0
def months_between(end: Column, start: Column, roundOff: Boolean): Column
Returns number of months between dates end and start.
Returns number of months between dates end and start. If roundOff is set to true, the result is rounded off to 8 digits; it is not rounded otherwise.
Since
2.4.0
def months_between(end: Column, start: Column): Column
Returns number of months between dates start and end.
Returns number of months between dates start and end.
A whole number is returned if both inputs have the same day of month or both are the last day of their respective months. Otherwise, the difference is calculated assuming 31 days per month.
For example:
```
months_between("2017-11-14", "2017-07-14")  // returns 4.0
months_between("2017-01-01", "2017-01-10")  // returns 0.29032258
months_between("2017-06-01", "2017-06-16 12:00:00")  // returns -0.5
```
end
A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
start
A date, timestamp or string. If a string, the data must be in a format that can cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
returns
A double, or null if either end or start were strings that could not be cast to a timestamp. Negative if end is before start
Since
1.5.0
def named_struct(cols: Column*): Column
Creates a struct with the given field names and values.
Creates a struct with the given field names and values.
Annotations
@varargs()
Since
3.5.0
def nanvl(col1: Column, col2: Column): Column
Returns col1 if it is not NaN, or col2 if col1 is NaN.
Returns col1 if it is not NaN, or col2 if col1 is NaN.
Both inputs should be floating point columns (DoubleType or FloatType).
Since
1.5.0
final def ne(arg0: AnyRef): Boolean
Definition Classes
AnyRef

def negate(e: Column): Column

Unary minus, i.e.

Unary minus, i.e. negate the expression.

// Select the amount column and negates all values.
// Scala:
df.select( -df("amount") )

// Java:
df.select( negate(df.col("amount")) );

Since: 1.3.0

def negative(e: Column): Column
Returns the negated value.
Returns the negated value.
Since
3.5.0
def next_day(date: Column, dayOfWeek: Column): Column
Returns the first date which is later than the value of the date column that is on the specified day of the week.
Returns the first date which is later than the value of the date column that is on the specified day of the week.
For example, next_day('2015-07-27', "Sunday") returns 2015-08-02 because that is the first Sunday after 2015-07-27.
date
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
dayOfWeek
A column of the day of week. Case insensitive, and accepts: "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"
returns
A date, or null if date was a string that could not be cast to a date or if dayOfWeek was an invalid value
Since
3.2.0
def next_day(date: Column, dayOfWeek: String): Column
Returns the first date which is later than the value of the date column that is on the specified day of the week.
Returns the first date which is later than the value of the date column that is on the specified day of the week.
For example, next_day('2015-07-27', "Sunday") returns 2015-08-02 because that is the first Sunday after 2015-07-27.
date
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
dayOfWeek
Case insensitive, and accepts: "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"
returns
A date, or null if date was a string that could not be cast to a date or if dayOfWeek was an invalid value
Since
1.5.0

def not(e: Column): Column

Inversion of boolean expression, i.e.

Inversion of boolean expression, i.e. NOT.

// Scala: select rows that are not active (isActive === false)
df.filter( !df("isActive") )

// Java:
df.filter( not(df.col("isActive")) );

Since: 1.3.0

final def notify(): Unit
Definition Classes
AnyRef
Annotations
@IntrinsicCandidate() @native()
final def notifyAll(): Unit
Definition Classes
AnyRef
Annotations
@IntrinsicCandidate() @native()
def now(): Column
Returns the current timestamp at the start of query evaluation.
Returns the current timestamp at the start of query evaluation.
Since
3.5.0
def nth_value(e: Column, offset: Int): Column
Window function: returns the value that is the offsetth row of the window frame (counting from 1), and null if the size of window frame is less than offset rows.
Window function: returns the value that is the offsetth row of the window frame (counting from 1), and null if the size of window frame is less than offset rows.
This is equivalent to the nth_value function in SQL.
Since
3.1.0
def nth_value(e: Column, offset: Int, ignoreNulls: Boolean): Column
Window function: returns the value that is the offsetth row of the window frame (counting from 1), and null if the size of window frame is less than offset rows.
Window function: returns the value that is the offsetth row of the window frame (counting from 1), and null if the size of window frame is less than offset rows.
It will return the offsetth non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
This is equivalent to the nth_value function in SQL.
Since
3.1.0
def ntile(n: Int): Column
Window function: returns the ntile group id (from 1 to n inclusive) in an ordered window partition.
Window function: returns the ntile group id (from 1 to n inclusive) in an ordered window partition. For example, if n is 4, the first quarter of the rows will get value 1, the second quarter will get 2, the third quarter will get 3, and the last quarter will get 4.
This is equivalent to the NTILE function in SQL.
Since
1.4.0
def nullif(col1: Column, col2: Column): Column
Returns null if col1 equals to col2, or col1 otherwise.
Returns null if col1 equals to col2, or col1 otherwise.
Since
3.5.0
def nullifzero(col: Column): Column
Returns null if col is equal to zero, or col otherwise.
Returns null if col is equal to zero, or col otherwise.
Since
4.0.0
def nvl(col1: Column, col2: Column): Column
Returns col2 if col1 is null, or col1 otherwise.
Returns col2 if col1 is null, or col1 otherwise.
Since
3.5.0
def nvl2(col1: Column, col2: Column, col3: Column): Column
Returns col2 if col1 is not null, or col3 otherwise.
Returns col2 if col1 is not null, or col3 otherwise.
Since
3.5.0
def octet_length(e: Column): Column
Calculates the byte length for the specified string column.
Calculates the byte length for the specified string column.
Since
3.3.0
def overlay(src: Column, replace: Column, pos: Column): Column
Overlay the specified portion of src with replace, starting from byte position pos of src.
Overlay the specified portion of src with replace, starting from byte position pos of src.
Since
3.0.0
def overlay(src: Column, replace: Column, pos: Column, len: Column): Column
Overlay the specified portion of src with replace, starting from byte position pos of src and proceeding for len bytes.
Overlay the specified portion of src with replace, starting from byte position pos of src and proceeding for len bytes.
Since
3.0.0
def parse_json(json: Column): Column
Parses a JSON string and constructs a Variant value.
Parses a JSON string and constructs a Variant value.
json
a string column that contains JSON data.
Since
4.0.0
def parse_url(url: Column, partToExtract: Column): Column
Extracts a part from a URL.
Extracts a part from a URL.
Since
3.5.0
def parse_url(url: Column, partToExtract: Column, key: Column): Column
Extracts a part from a URL.
Extracts a part from a URL.
Since
3.5.0
def percent_rank(): Column
Window function: returns the relative rank (i.e.
Window function: returns the relative rank (i.e. percentile) of rows within a window partition.
This is computed by:
```
(rank of row in its partition - 1) / (number of rows in the partition - 1)
```
This is equivalent to the PERCENT_RANK function in SQL.
Since
1.6.0
def percentile(e: Column, percentage: Column, frequency: Column): Column
Aggregate function: returns the exact percentile(s) of numeric column expr at the given percentage(s) with value range in [0.0, 1.0].
Aggregate function: returns the exact percentile(s) of numeric column expr at the given percentage(s) with value range in [0.0, 1.0].
Since
3.5.0
def percentile(e: Column, percentage: Column): Column
Aggregate function: returns the exact percentile(s) of numeric column expr at the given percentage(s) with value range in [0.0, 1.0].
Aggregate function: returns the exact percentile(s) of numeric column expr at the given percentage(s) with value range in [0.0, 1.0].
Since
3.5.0
def percentile_approx(e: Column, percentage: Column, accuracy: Column): Column
Aggregate function: returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value.
Aggregate function: returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value.
If percentage is an array, each value must be between 0.0 and 1.0. If it is a single floating point value, it must be between 0.0 and 1.0.
The accuracy parameter is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation.
Since
3.1.0
def pi(): Column
Returns Pi.
Returns Pi.
Since
3.5.0
def pmod(dividend: Column, divisor: Column): Column
Returns the positive value of dividend mod divisor.
Returns the positive value of dividend mod divisor.
Since
1.5.0
def posexplode(e: Column): Column
Creates a new row for each element with position in the given array or map column.
Creates a new row for each element with position in the given array or map column. Uses the default column name pos for position, and col for elements in the array and key and value for elements in the map unless specified otherwise.
Since
2.1.0
def posexplode_outer(e: Column): Column
Creates a new row for each element with position in the given array or map column.
Creates a new row for each element with position in the given array or map column. Uses the default column name pos for position, and col for elements in the array and key and value for elements in the map unless specified otherwise. Unlike posexplode, if the array/map is null or empty then the row (null, null) is produced.
Since
2.2.0
def position(substr: Column, str: Column): Column
Returns the position of the first occurrence of substr in str after position 1.
Returns the position of the first occurrence of substr in str after position 1. The return value are 1-based.
Since
3.5.0
def position(substr: Column, str: Column, start: Column): Column
Returns the position of the first occurrence of substr in str after position start.
Returns the position of the first occurrence of substr in str after position start. The given start and return value are 1-based.
Since
3.5.0
def positive(e: Column): Column
Returns the value.
Returns the value.
Since
3.5.0
def pow(l: Double, rightName: String): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
Since
1.4.0
def pow(l: Double, r: Column): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
Since
1.4.0
def pow(leftName: String, r: Double): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
Since
1.4.0
def pow(l: Column, r: Double): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
Since
1.4.0
def pow(leftName: String, rightName: String): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
Since
1.4.0
def pow(leftName: String, r: Column): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
Since
1.4.0
def pow(l: Column, rightName: String): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
Since
1.4.0
def pow(l: Column, r: Column): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
Since
1.4.0
def power(l: Column, r: Column): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
Since
3.5.0
def printf(format: Column, arguments: Column*): Column
Formats the arguments in printf-style and returns the result as a string column.
Formats the arguments in printf-style and returns the result as a string column.
Annotations
@varargs()
Since
3.5.0
def product(e: Column): Column
Aggregate function: returns the product of all numerical elements in a group.
Aggregate function: returns the product of all numerical elements in a group.
Since
3.2.0
def quarter(e: Column): Column
Extracts the quarter as an integer from a given date/timestamp/string.
Extracts the quarter as an integer from a given date/timestamp/string.
returns
An integer, or null if the input was a string that could not be cast to a date
Since
1.5.0
def quote(str: Column): Column
Returns str enclosed by single quotes and each instance of single quote in it is preceded by a backslash.
Returns str enclosed by single quotes and each instance of single quote in it is preceded by a backslash.
Since
4.1.0
def radians(columnName: String): Column
Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
columnName
angle in degrees
returns
angle in radians, as if computed by java.lang.Math.toRadians
Since
2.1.0
def radians(e: Column): Column
Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
e
angle in degrees
returns
angle in radians, as if computed by java.lang.Math.toRadians
Since
2.1.0
def raise_error(c: Column): Column
Throws an exception with the provided error message.
Throws an exception with the provided error message.
Since
3.1.0
def rand(): Column
Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
Since
1.4.0
Note
The function is non-deterministic in general case.
def rand(seed: Long): Column
Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
Since
1.4.0
Note
The function is non-deterministic in general case.
def randn(): Column
Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
Since
1.4.0
Note
The function is non-deterministic in general case.
def randn(seed: Long): Column
Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
Since
1.4.0
Note
The function is non-deterministic in general case.
def random(): Column
Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).
Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).
Since
3.5.0
def random(seed: Column): Column
Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).
Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).
Since
3.5.0
def randstr(length: Column, seed: Column): Column
Returns a string of the specified length whose characters are chosen uniformly at random from the following pool of characters: 0-9, a-z, A-Z, with the chosen random seed.
Returns a string of the specified length whose characters are chosen uniformly at random from the following pool of characters: 0-9, a-z, A-Z, with the chosen random seed. The string length must be a constant two-byte or four-byte integer (SMALLINT or INT, respectively).
Since
4.0.0
def randstr(length: Column): Column
Returns a string of the specified length whose characters are chosen uniformly at random from the following pool of characters: 0-9, a-z, A-Z.
Returns a string of the specified length whose characters are chosen uniformly at random from the following pool of characters: 0-9, a-z, A-Z. The string length must be a constant two-byte or four-byte integer (SMALLINT or INT, respectively).
Since
4.0.0
def rank(): Column
Window function: returns the rank of rows within a window partition.
Window function: returns the rank of rows within a window partition.
The difference between rank and dense_rank is that dense_rank leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using dense_rank and had three people tie for second place, you would say that all three were in second place and that the next person came in third. Rank would give me sequential numbers, making the person that came in third place (after the ties) would register as coming in fifth.
This is equivalent to the RANK function in SQL.
Since
1.4.0
def reduce(expr: Column, initialValue: Column, merge: (Column, Column) => Column): Column
Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.
Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.
```
df.select(aggregate(col("i"), lit(0), (acc, x) => acc + x))
```
expr
the input array column
initialValue
the initial value
merge
(combined_value, input_value) => combined_value, the merge function to merge an input value to the combined_value
Since
3.5.0
def reduce(expr: Column, initialValue: Column, merge: (Column, Column) => Column, finish: (Column) => Column): Column
Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.
Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function.
```
df.select(aggregate(col("i"), lit(0), (acc, x) => acc + x, _ * 10))
```
expr
the input array column
initialValue
the initial value
merge
(combined_value, input_value) => combined_value, the merge function to merge an input value to the combined_value
finish
combined_value => final_value, the lambda function to convert the combined value of all inputs to final result
Since
3.5.0
def reflect(cols: Column*): Column
Calls a method with reflection.
Calls a method with reflection.
Annotations
@varargs()
Since
3.5.0
def regexp(str: Column, regexp: Column): Column
Returns true if str matches regexp, or false otherwise.
Returns true if str matches regexp, or false otherwise.
Since
3.5.0
def regexp_count(str: Column, regexp: Column): Column
Returns a count of the number of times that the regular expression pattern regexp is matched in the string str.
Returns a count of the number of times that the regular expression pattern regexp is matched in the string str.
Since
3.5.0
def regexp_extract(e: Column, exp: String, groupIdx: Int): Column
Extract a specific group matched by a Java regex, from the specified string column.
Extract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned. if the specified group index exceeds the group count of regex, an IllegalArgumentException will be thrown.
Since
1.5.0
def regexp_extract_all(str: Column, regexp: Column, idx: Column): Column
Extract all strings in the str that match the regexp expression and corresponding to the regex group index.
Extract all strings in the str that match the regexp expression and corresponding to the regex group index.
Since
3.5.0
def regexp_extract_all(str: Column, regexp: Column): Column
Extract all strings in the str that match the regexp expression and corresponding to the first regex group index.
Extract all strings in the str that match the regexp expression and corresponding to the first regex group index.
Since
3.5.0
def regexp_instr(str: Column, regexp: Column, idx: Column): Column
Searches a string for a regular expression and returns an integer that indicates the beginning position of the matched substring.
Searches a string for a regular expression and returns an integer that indicates the beginning position of the matched substring. Positions are 1-based, not 0-based. If no match is found, returns 0.
Since
3.5.0
def regexp_instr(str: Column, regexp: Column): Column
Searches a string for a regular expression and returns an integer that indicates the beginning position of the matched substring.
Searches a string for a regular expression and returns an integer that indicates the beginning position of the matched substring. Positions are 1-based, not 0-based. If no match is found, returns 0.
Since
3.5.0
def regexp_like(str: Column, regexp: Column): Column
Returns true if str matches regexp, or false otherwise.
Returns true if str matches regexp, or false otherwise.
Since
3.5.0
def regexp_replace(e: Column, pattern: Column, replacement: Column): Column
Replace all substrings of the specified string value that match regexp with rep.
Replace all substrings of the specified string value that match regexp with rep.
Since
2.1.0
def regexp_replace(e: Column, pattern: String, replacement: String): Column
Replace all substrings of the specified string value that match regexp with rep.
Replace all substrings of the specified string value that match regexp with rep.
Since
1.5.0
def regexp_substr(str: Column, regexp: Column): Column
Returns the substring that matches the regular expression regexp within the string str.
Returns the substring that matches the regular expression regexp within the string str. If the regular expression is not found, the result is null.
Since
3.5.0
def regr_avgx(y: Column, x: Column): Column
Aggregate function: returns the average of the independent variable for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
Aggregate function: returns the average of the independent variable for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
Since
3.5.0
def regr_avgy(y: Column, x: Column): Column
Aggregate function: returns the average of the independent variable for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
Aggregate function: returns the average of the independent variable for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
Since
3.5.0
def regr_count(y: Column, x: Column): Column
Aggregate function: returns the number of non-null number pairs in a group, where y is the dependent variable and x is the independent variable.
Aggregate function: returns the number of non-null number pairs in a group, where y is the dependent variable and x is the independent variable.
Since
3.5.0
def regr_intercept(y: Column, x: Column): Column
Aggregate function: returns the intercept of the univariate linear regression line for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
Aggregate function: returns the intercept of the univariate linear regression line for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
Since
3.5.0
def regr_r2(y: Column, x: Column): Column
Aggregate function: returns the coefficient of determination for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
Aggregate function: returns the coefficient of determination for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
Since
3.5.0
def regr_slope(y: Column, x: Column): Column
Aggregate function: returns the slope of the linear regression line for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
Aggregate function: returns the slope of the linear regression line for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
Since
3.5.0
def regr_sxx(y: Column, x: Column): Column
Aggregate function: returns REGR_COUNT(y, x) * VAR_POP(x) for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
Aggregate function: returns REGR_COUNT(y, x) * VAR_POP(x) for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
Since
3.5.0
def regr_sxy(y: Column, x: Column): Column
Aggregate function: returns REGR_COUNT(y, x) * COVAR_POP(y, x) for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
Aggregate function: returns REGR_COUNT(y, x) * COVAR_POP(y, x) for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
Since
3.5.0
def regr_syy(y: Column, x: Column): Column
Aggregate function: returns REGR_COUNT(y, x) * VAR_POP(y) for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
Aggregate function: returns REGR_COUNT(y, x) * VAR_POP(y) for non-null pairs in a group, where y is the dependent variable and x is the independent variable.
Since
3.5.0
def repeat(str: Column, n: Column): Column
Repeats a string column n times, and returns it as a new string column.
Repeats a string column n times, and returns it as a new string column.
Since
4.0.0
def repeat(str: Column, n: Int): Column
Repeats a string column n times, and returns it as a new string column.
Repeats a string column n times, and returns it as a new string column.
Since
1.5.0
def replace(src: Column, search: Column): Column
Replaces all occurrences of search with replace.
Replaces all occurrences of search with replace.
src
A column of string to be replaced
search
A column of string, If search is not found in src, src is returned unchanged.
Since
3.5.0
def replace(src: Column, search: Column, replace: Column): Column
Replaces all occurrences of search with replace.
Replaces all occurrences of search with replace.
src
A column of string to be replaced
search
A column of string, If search is not found in str, str is returned unchanged.
replace
A column of string, If replace is not specified or is an empty string, nothing replaces the string that is removed from str.
Since
3.5.0
def reverse(e: Column): Column
Returns a reversed string or an array with reverse order of elements.
Returns a reversed string or an array with reverse order of elements.
Since
1.5.0
def right(str: Column, len: Column): Column
Returns the rightmost len(len can be string type) characters from the string str, if len is less or equal than 0 the result is an empty string.
Returns the rightmost len(len can be string type) characters from the string str, if len is less or equal than 0 the result is an empty string.
Since
3.5.0
def rint(columnName: String): Column
Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
Since
1.4.0
def rint(e: Column): Column
Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
Since
1.4.0
def rlike(str: Column, regexp: Column): Column
Returns true if str matches regexp, or false otherwise.
Returns true if str matches regexp, or false otherwise.
Since
3.5.0
def round(e: Column, scale: Column): Column
Round the value of e to scale decimal places with HALF_UP round mode if scale is greater than or equal to 0 or at integral part when scale is less than 0.
Round the value of e to scale decimal places with HALF_UP round mode if scale is greater than or equal to 0 or at integral part when scale is less than 0.
Since
4.0.0
def round(e: Column, scale: Int): Column
Round the value of e to scale decimal places with HALF_UP round mode if scale is greater than or equal to 0 or at integral part when scale is less than 0.
Round the value of e to scale decimal places with HALF_UP round mode if scale is greater than or equal to 0 or at integral part when scale is less than 0.
Since
1.5.0
def round(e: Column): Column
Returns the value of the column e rounded to 0 decimal places with HALF_UP round mode.
Returns the value of the column e rounded to 0 decimal places with HALF_UP round mode.
Since
1.5.0
def row_number(): Column
Window function: returns a sequential number starting at 1 within a window partition.
Window function: returns a sequential number starting at 1 within a window partition.
Since
1.6.0
def rpad(str: Column, len: Column, pad: Column): Column
Right-pad the string column with pad to a length of len.
Right-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.
Since
4.0.0
def rpad(str: Column, len: Int, pad: Array[Byte]): Column
Right-pad the binary column with pad to a byte length of len.
Right-pad the binary column with pad to a byte length of len. If the binary column is longer than len, the return value is shortened to len bytes.
Since
3.3.0
def rpad(str: Column, len: Int, pad: String): Column
Right-pad the string column with pad to a length of len.
Right-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.
Since
1.5.0
def rtrim(e: Column, trim: Column): Column
Trim the specified character string from right end for the specified string column.
Trim the specified character string from right end for the specified string column.
Since
4.0.0
def rtrim(e: Column, trimString: String): Column
Trim the specified character string from right end for the specified string column.
Trim the specified character string from right end for the specified string column.
Since
2.3.0
def rtrim(e: Column): Column
Trim the spaces from right end for the specified string value.
Trim the spaces from right end for the specified string value.
Since
1.5.0
def schema_of_csv(csv: Column, options: Map[String, String]): Column
Parses a CSV string and infers its schema in DDL format using options.
Parses a CSV string and infers its schema in DDL format using options.
csv
a foldable string column containing a CSV string.
options
options to control how the CSV is parsed. accepts the same options and the CSV data source. See Data Source Option in the version you use.
returns
a column with string literal containing schema in DDL format.
Since
3.0.0
def schema_of_csv(csv: Column): Column
Parses a CSV string and infers its schema in DDL format.
Parses a CSV string and infers its schema in DDL format.
csv
a foldable string column containing a CSV string.
Since
3.0.0
def schema_of_csv(csv: String): Column
Parses a CSV string and infers its schema in DDL format.
Parses a CSV string and infers its schema in DDL format.
csv
a CSV string.
Since
3.0.0
def schema_of_json(json: Column, options: Map[String, String]): Column
Parses a JSON string and infers its schema in DDL format using options.
Parses a JSON string and infers its schema in DDL format using options.
json
a foldable string column containing JSON data.
options
options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.
returns
a column with string literal containing schema in DDL format.
Since
3.0.0
def schema_of_json(json: Column): Column
Parses a JSON string and infers its schema in DDL format.
Parses a JSON string and infers its schema in DDL format.
json
a foldable string column containing a JSON string.
Since
2.4.0
def schema_of_json(json: String): Column
Parses a JSON string and infers its schema in DDL format.
Parses a JSON string and infers its schema in DDL format.
json
a JSON string.
Since
2.4.0
def schema_of_variant(v: Column): Column
Returns schema in the SQL format of a variant.
Returns schema in the SQL format of a variant.
v
a variant column.
Since
4.0.0
def schema_of_variant_agg(v: Column): Column
Returns the merged schema in the SQL format of a variant column.
Returns the merged schema in the SQL format of a variant column.
v
a variant column.
Since
4.0.0
def schema_of_xml(xml: Column, options: Map[String, String]): Column
Parses a XML string and infers its schema in DDL format using options.
Parses a XML string and infers its schema in DDL format using options.
xml
a foldable string column containing XML data.
options
options to control how the xml is parsed. accepts the same options and the XML data source. See Data Source Option in the version you use.
returns
a column with string literal containing schema in DDL format.
Since
4.0.0
def schema_of_xml(xml: Column): Column
Parses a XML string and infers its schema in DDL format.
Parses a XML string and infers its schema in DDL format.
xml
a foldable string column containing a XML string.
Since
4.0.0
def schema_of_xml(xml: String): Column
Parses a XML string and infers its schema in DDL format.
Parses a XML string and infers its schema in DDL format.
xml
a XML string.
Since
4.0.0
def sec(e: Column): Column
e
angle in radians
returns
secant of the angle
Since
3.3.0
def second(e: Column): Column
Extracts the seconds as an integer from a given date/time/timestamp/string.
Extracts the seconds as an integer from a given date/time/timestamp/string.
returns
An integer, or null if the input was a string that could not be cast to a timestamp
Since
1.5.0
def sentences(string: Column): Column
Splits a string into arrays of sentences, where each sentence is an array of words.
Splits a string into arrays of sentences, where each sentence is an array of words. The default locale is used.
Since
3.2.0
def sentences(string: Column, language: Column): Column
Splits a string into arrays of sentences, where each sentence is an array of words.
Splits a string into arrays of sentences, where each sentence is an array of words. The default country() is used.
Since
4.0.0
def sentences(string: Column, language: Column, country: Column): Column
Splits a string into arrays of sentences, where each sentence is an array of words.
Splits a string into arrays of sentences, where each sentence is an array of words.
Since
3.2.0
def sequence(start: Column, stop: Column): Column
Generate a sequence of integers from start to stop, incrementing by 1 if start is less than or equal to stop, otherwise -1.
Generate a sequence of integers from start to stop, incrementing by 1 if start is less than or equal to stop, otherwise -1.
Since
2.4.0
def sequence(start: Column, stop: Column, step: Column): Column
Generate a sequence of integers from start to stop, incrementing by step.
Generate a sequence of integers from start to stop, incrementing by step.
Since
2.4.0
def session_user(): Column
Returns the user name of current execution context.
Returns the user name of current execution context.
Since
4.0.0
def session_window(timeColumn: Column, gapDuration: Column): Column
Generates session window given a timestamp specifying column.
Generates session window given a timestamp specifying column.
Session window is one of dynamic windows, which means the length of window is varying according to the given inputs. For static gap duration, the length of session window is defined as "the timestamp of latest input of the session + gap duration", so when the new inputs are bound to the current session window, the end time of session window can be expanded according to the new inputs.
Besides a static gap duration value, users can also provide an expression to specify gap duration dynamically based on the input row. With dynamic gap duration, the closing of a session window does not depend on the latest input anymore. A session window's range is the union of all events' ranges which are determined by event start time and evaluated gap duration during the query execution. Note that the rows with negative or zero gap duration will be filtered out from the aggregation.
Windows can support microsecond precision. gapDuration in the order of months are not supported.
For a streaming query, you may use the function current_timestamp to generate windows on processing time.
timeColumn
The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType.
gapDuration
A column specifying the timeout of the session. It could be static value, e.g. 10 minutes, 1 second, or an expression/UDF that specifies gap duration dynamically based on the input row.
Since
3.2.0
def session_window(timeColumn: Column, gapDuration: String): Column
Generates session window given a timestamp specifying column.
Generates session window given a timestamp specifying column.
Session window is one of dynamic windows, which means the length of window is varying according to the given inputs. The length of session window is defined as "the timestamp of latest input of the session + gap duration", so when the new inputs are bound to the current session window, the end time of session window can be expanded according to the new inputs.
Windows can support microsecond precision. gapDuration in the order of months are not supported.
For a streaming query, you may use the function current_timestamp to generate windows on processing time.
timeColumn
The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType.
gapDuration
A string specifying the timeout of the session, e.g. 10 minutes, 1 second. Check org.apache.spark.unsafe.types.CalendarInterval for valid duration identifiers.
Since
3.2.0
def sha(col: Column): Column
Returns a sha1 hash value as a hex string of the col.
Returns a sha1 hash value as a hex string of the col.
Since
3.5.0
def sha1(e: Column): Column
Calculates the SHA-1 digest of a binary column and returns the value as a 40 character hex string.
Calculates the SHA-1 digest of a binary column and returns the value as a 40 character hex string.
Since
1.5.0
def sha2(e: Column, numBits: Int): Column
Calculates the SHA-2 family of hash functions of a binary column and returns the value as a hex string.
Calculates the SHA-2 family of hash functions of a binary column and returns the value as a hex string.
e
column to compute SHA-2 on.
numBits
one of 224, 256, 384, or 512.
Since
1.5.0
def shiftleft(e: Column, numBits: Int): Column
Shift the given value numBits left.
Shift the given value numBits left. If the given value is a long value, this function will return a long value else it will return an integer value.
Since
3.2.0
def shiftright(e: Column, numBits: Int): Column
(Signed) shift the given value numBits right.
(Signed) shift the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value.
Since
3.2.0
def shiftrightunsigned(e: Column, numBits: Int): Column
Unsigned shift the given value numBits right.
Unsigned shift the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value.
Since
3.2.0
def shuffle(e: Column, seed: Column): Column
Returns a random permutation of the given array.
Returns a random permutation of the given array.
Since
4.0.0
Note
The function is non-deterministic.
def shuffle(e: Column): Column
Returns a random permutation of the given array.
Returns a random permutation of the given array.
Since
2.4.0
Note
The function is non-deterministic.
def sign(e: Column): Column
Computes the signum of the given value.
Computes the signum of the given value.
Since
3.5.0
def signum(columnName: String): Column
Computes the signum of the given column.
Computes the signum of the given column.
Since
1.4.0
def signum(e: Column): Column
Computes the signum of the given value.
Computes the signum of the given value.
Since
1.4.0
def sin(columnName: String): Column
columnName
angle in radians
returns
sine of the angle, as if computed by java.lang.Math.sin
Since
1.4.0
def sin(e: Column): Column
e
angle in radians
returns
sine of the angle, as if computed by java.lang.Math.sin
Since
1.4.0
def sinh(columnName: String): Column
columnName
hyperbolic angle
returns
hyperbolic sine of the given value, as if computed by java.lang.Math.sinh
Since
1.4.0
def sinh(e: Column): Column
e
hyperbolic angle
returns
hyperbolic sine of the given value, as if computed by java.lang.Math.sinh
Since
1.4.0
def size(e: Column): Column
Returns length of array or map.
Returns length of array or map.
This function returns -1 for null input only if spark.sql.ansi.enabled is false and spark.sql.legacy.sizeOfNull is true. Otherwise, it returns null for null input. With the default settings, the function returns null for null input.
Since
1.5.0
def skewness(columnName: String): Column
Aggregate function: returns the skewness of the values in a group.
Aggregate function: returns the skewness of the values in a group.
Since
1.6.0
def skewness(e: Column): Column
Aggregate function: returns the skewness of the values in a group.
Aggregate function: returns the skewness of the values in a group.
Since
1.6.0
def slice(x: Column, start: Column, length: Column): Column
Returns an array containing all the elements in x from index start (or starting from the end if start is negative) with the specified length.
Returns an array containing all the elements in x from index start (or starting from the end if start is negative) with the specified length.
x
the array column to be sliced
start
the starting index
length
the length of the slice
Since
3.1.0
def slice(x: Column, start: Int, length: Int): Column
Returns an array containing all the elements in x from index start (or starting from the end if start is negative) with the specified length.
Returns an array containing all the elements in x from index start (or starting from the end if start is negative) with the specified length.
x
the array column to be sliced
start
the starting index
length
the length of the slice
Since
2.4.0
def some(e: Column): Column
Aggregate function: returns true if at least one value of e is true.
Aggregate function: returns true if at least one value of e is true.
Since
3.5.0
def sort_array(e: Column, asc: Boolean): Column
Sorts the input array for the given column in ascending or descending order, according to the natural ordering of the array elements.
Sorts the input array for the given column in ascending or descending order, according to the natural ordering of the array elements. NaN is greater than any non-NaN elements for double/float type. Null elements will be placed at the beginning of the returned array in ascending order or at the end of the returned array in descending order.
Since
1.5.0
def sort_array(e: Column): Column
Sorts the input array for the given column in ascending order, according to the natural ordering of the array elements.
Sorts the input array for the given column in ascending order, according to the natural ordering of the array elements. Null elements will be placed at the beginning of the returned array.
Since
1.5.0
def soundex(e: Column): Column
Returns the soundex code for the specified expression.
Returns the soundex code for the specified expression.
Since
1.5.0
def spark_partition_id(): Column
Partition ID.
Partition ID.
Since
1.6.0
Note
This is non-deterministic because it depends on data partitioning and task scheduling.
def split(str: Column, pattern: Column, limit: Column): Column
Splits str around matches of the given pattern.
Splits str around matches of the given pattern.
str
a string expression to split
pattern
a column of string representing a regular expression. The regex string should be a Java regular expression.
limit
a column of integer expression which controls the number of times the regex is applied.
- limit greater than 0: The resulting array's length will not be more than limit, and the resulting array's last entry will contain all input beyond the last matched regex.
- limit less than or equal to 0: regex will be applied as many times as possible, and the resulting array can be of any size.
Since
4.0.0
def split(str: Column, pattern: String, limit: Int): Column
Splits str around matches of the given pattern.
Splits str around matches of the given pattern.
str
a string expression to split
pattern
a string representing a regular expression. The regex string should be a Java regular expression.
limit
an integer expression which controls the number of times the regex is applied.
- limit greater than 0: The resulting array's length will not be more than limit, and the resulting array's last entry will contain all input beyond the last matched regex.
- limit less than or equal to 0: regex will be applied as many times as possible, and the resulting array can be of any size.
Since
3.0.0
def split(str: Column, pattern: Column): Column
Splits str around matches of the given pattern.
Splits str around matches of the given pattern.
str
a string expression to split
pattern
a column of string representing a regular expression. The regex string should be a Java regular expression.
Since
4.0.0
def split(str: Column, pattern: String): Column
Splits str around matches of the given pattern.
Splits str around matches of the given pattern.
str
a string expression to split
pattern
a string representing a regular expression. The regex string should be a Java regular expression.
Since
1.5.0
def split_part(str: Column, delimiter: Column, partNum: Column): Column
Splits str by delimiter and return requested part of the split (1-based).
Splits str by delimiter and return requested part of the split (1-based). If any input is null, returns null. if partNum is out of range of split parts, returns empty string. If partNum is 0, throws an error. If partNum is negative, the parts are counted backward from the end of the string. If the delimiter is an empty string, the str is not split.
Since
3.5.0
def sqrt(colName: String): Column
Computes the square root of the specified float value.
Computes the square root of the specified float value.
Since
1.5.0
def sqrt(e: Column): Column
Computes the square root of the specified float value.
Computes the square root of the specified float value.
Since
1.3.0
def stack(cols: Column*): Column
Separates col1, ..., colk into n rows.
Separates col1, ..., colk into n rows. Uses column names col0, col1, etc. by default unless specified otherwise.
Annotations
@varargs()
Since
3.5.0
def startswith(str: Column, prefix: Column): Column
Returns a boolean.
Returns a boolean. The value is True if str starts with prefix. Returns NULL if either input expression is NULL. Otherwise, returns False. Both str or prefix must be of STRING or BINARY type.
Since
3.5.0
def std(e: Column): Column
Aggregate function: alias for stddev_samp.
Aggregate function: alias for stddev_samp.
Since
3.5.0
def stddev(columnName: String): Column
Aggregate function: alias for stddev_samp.
Aggregate function: alias for stddev_samp.
Since
1.6.0
def stddev(e: Column): Column
Aggregate function: alias for stddev_samp.
Aggregate function: alias for stddev_samp.
Since
1.6.0
def stddev_pop(columnName: String): Column
Aggregate function: returns the population standard deviation of the expression in a group.
Aggregate function: returns the population standard deviation of the expression in a group.
Since
1.6.0
def stddev_pop(e: Column): Column
Aggregate function: returns the population standard deviation of the expression in a group.
Aggregate function: returns the population standard deviation of the expression in a group.
Since
1.6.0
def stddev_samp(columnName: String): Column
Aggregate function: returns the sample standard deviation of the expression in a group.
Aggregate function: returns the sample standard deviation of the expression in a group.
Since
1.6.0
def stddev_samp(e: Column): Column
Aggregate function: returns the sample standard deviation of the expression in a group.
Aggregate function: returns the sample standard deviation of the expression in a group.
Since
1.6.0
def str_to_map(text: Column): Column
Creates a map after splitting the text into key/value pairs using delimiters.
Creates a map after splitting the text into key/value pairs using delimiters.
Since
3.5.0
def str_to_map(text: Column, pairDelim: Column): Column
Creates a map after splitting the text into key/value pairs using delimiters.
Creates a map after splitting the text into key/value pairs using delimiters. The pairDelim is treated as regular expressions.
Since
3.5.0
def str_to_map(text: Column, pairDelim: Column, keyValueDelim: Column): Column
Creates a map after splitting the text into key/value pairs using delimiters.
Creates a map after splitting the text into key/value pairs using delimiters. Both pairDelim and keyValueDelim are treated as regular expressions.
Since
3.5.0
def string_agg(e: Column, delimiter: Column): Column
Aggregate function: returns the concatenation of non-null input values, separated by the delimiter.
Aggregate function: returns the concatenation of non-null input values, separated by the delimiter. Alias for listagg.
Since
4.0.0
def string_agg(e: Column): Column
Aggregate function: returns the concatenation of non-null input values.
Aggregate function: returns the concatenation of non-null input values. Alias for listagg.
Since
4.0.0
def string_agg_distinct(e: Column, delimiter: Column): Column
Aggregate function: returns the concatenation of distinct non-null input values, separated by the delimiter.
Aggregate function: returns the concatenation of distinct non-null input values, separated by the delimiter. Alias for listagg.
Since
4.0.0
def string_agg_distinct(e: Column): Column
Aggregate function: returns the concatenation of distinct non-null input values.
Aggregate function: returns the concatenation of distinct non-null input values. Alias for listagg.
Since
4.0.0
def struct(colName: String, colNames: String*): Column
Creates a new struct column that composes multiple input columns.
Creates a new struct column that composes multiple input columns.
Annotations
@varargs()
Since
1.4.0
def struct(cols: Column*): Column
Creates a new struct column.
Creates a new struct column. If the input column is a column in a DataFrame, or a derived column expression that is named (i.e. aliased), its name would be retained as the StructField's name, otherwise, the newly generated StructField's name would be auto generated as col with a suffix index + 1, i.e. col1, col2, col3, ...
Annotations
@varargs()
Since
1.4.0
def substr(str: Column, pos: Column): Column
Returns the substring of str that starts at pos, or the slice of byte array that starts at pos.
Returns the substring of str that starts at pos, or the slice of byte array that starts at pos.
Since
3.5.0
def substr(str: Column, pos: Column, len: Column): Column
Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.
Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.
Since
3.5.0
def substring(str: Column, pos: Column, len: Column): Column
Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type
Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type
Since
4.0.0
Note
The position is not zero based, but 1 based index.
def substring(str: Column, pos: Int, len: Int): Column
Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type
Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type
Since
1.5.0
Note
The position is not zero based, but 1 based index.
def substring_index(str: Column, delim: String, count: Int): Column
Returns the substring from string str before count occurrences of the delimiter delim.
Returns the substring from string str before count occurrences of the delimiter delim. If count is positive, everything the left of the final delimiter (counting from left) is returned. If count is negative, every to the right of the final delimiter (counting from the right) is returned. substring_index performs a case-sensitive match when searching for delim.
def sum(columnName: String): Column
Aggregate function: returns the sum of all values in the given column.
Aggregate function: returns the sum of all values in the given column.
Since
1.3.0
def sum(e: Column): Column
Aggregate function: returns the sum of all values in the expression.
Aggregate function: returns the sum of all values in the expression.
Since
1.3.0
def sum_distinct(e: Column): Column
Aggregate function: returns the sum of distinct values in the expression.
Aggregate function: returns the sum of distinct values in the expression.
Since
3.2.0
final def synchronized[T0](arg0: => T0): T0
Definition Classes
AnyRef
def tan(columnName: String): Column
columnName
angle in radians
returns
tangent of the given value, as if computed by java.lang.Math.tan
Since
1.4.0
def tan(e: Column): Column
e
angle in radians
returns
tangent of the given value, as if computed by java.lang.Math.tan
Since
1.4.0
def tanh(columnName: String): Column
columnName
hyperbolic angle
returns
hyperbolic tangent of the given value, as if computed by java.lang.Math.tanh
Since
1.4.0
def tanh(e: Column): Column
e
hyperbolic angle
returns
hyperbolic tangent of the given value, as if computed by java.lang.Math.tanh
Since
1.4.0
def theta_difference(columnName1: String, columnName2: String): Column
Subtracts two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches AnotB object
Subtracts two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches AnotB object
Since
4.1.0
def theta_difference(c1: Column, c2: Column): Column
Subtracts two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches AnotB object
Subtracts two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches AnotB object
Since
4.1.0
def theta_intersection(columnName1: String, columnName2: String): Column
Intersects two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches Intersection object
Intersects two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches Intersection object
Since
4.1.0
def theta_intersection(c1: Column, c2: Column): Column
Intersects two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches Intersection object
Intersects two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches Intersection object
Since
4.1.0
def theta_intersection_agg(columnName: String): Column
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch, generated by intersecting the Datasketches ThetaSketch instances in the input volumn via a Datasketches Intersection instance.
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch, generated by intersecting the Datasketches ThetaSketch instances in the input volumn via a Datasketches Intersection instance.
Since
4.1.0
def theta_intersection_agg(e: Column): Column
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch, generated by intersecting the Datasketches ThetaSketch instances in the input column via a Datasketches Intersection instance.
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch, generated by intersecting the Datasketches ThetaSketch instances in the input column via a Datasketches Intersection instance.
Since
4.1.0
def theta_sketch_agg(columnName: String): Column
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch built with the values in the input column and configured with the default value of 12 for lgNomEntries.
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch built with the values in the input column and configured with the default value of 12 for lgNomEntries.
Since
4.1.0
def theta_sketch_agg(e: Column): Column
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch built with the values in the input column and configured with the default value of 12 for lgNomEntries.
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch built with the values in the input column and configured with the default value of 12 for lgNomEntries.
Since
4.1.0
def theta_sketch_agg(columnName: String, lgNomEntries: Int): Column
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch built with the values in the input column and configured with the lgNomEntries nominal entries.
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch built with the values in the input column and configured with the lgNomEntries nominal entries.
Since
4.1.0
def theta_sketch_agg(e: Column, lgNomEntries: Int): Column
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch built with the values in the input column and configured with the lgNomEntries nominal entries.
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch built with the values in the input column and configured with the lgNomEntries nominal entries.
Since
4.1.0
def theta_sketch_agg(e: Column, lgNomEntries: Column): Column
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch built with the values in the input column and configured with the lgNomEntries nominal entries.
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch built with the values in the input column and configured with the lgNomEntries nominal entries.
Since
4.1.0
def theta_sketch_estimate(columnName: String): Column
Returns the estimated number of unique values given the binary representation of a Datasketches ThetaSketch.
Returns the estimated number of unique values given the binary representation of a Datasketches ThetaSketch.
Since
4.1.0
def theta_sketch_estimate(c: Column): Column
Returns the estimated number of unique values given the binary representation of a Datasketches ThetaSketch.
Returns the estimated number of unique values given the binary representation of a Datasketches ThetaSketch.
Since
4.1.0
def theta_union(c1: Column, c2: Column, lgNomEntries: Column): Column
Unions two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches Union object.
Unions two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches Union object. It allows the configuration of lgNomEntries log nominal entries for the union buffer.
Since
4.1.0
def theta_union(columnName1: String, columnName2: String, lgNomEntries: Int): Column
Unions two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches Union object.
Unions two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches Union object. It allows the configuration of lgNomEntries log nominal entries for the union buffer.
Since
4.1.0
def theta_union(c1: Column, c2: Column, lgNomEntries: Int): Column
Unions two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches Union object.
Unions two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches Union object. It allows the configuration of lgNomEntries log nominal entries for the union buffer.
Since
4.1.0
def theta_union(columnName1: String, columnName2: String): Column
Unions two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches Union object.
Unions two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches Union object. It is configured with the default value of 12 for lgNomEntries.
Since
4.1.0
def theta_union(c1: Column, c2: Column): Column
Unions two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches Union object.
Unions two binary representations of Datasketches ThetaSketch objects in the input columns using a Datasketches Union object. It is configured with the default value of 12 for lgNomEntries.
Since
4.1.0
def theta_union_agg(columnName: String): Column
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch, generated by the union of Datasketches ThetaSketch instances in the input column via a Datasketches Union instance.
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch, generated by the union of Datasketches ThetaSketch instances in the input column via a Datasketches Union instance. It is configured with the default value of 12 for lgNomEntries.
Since
4.1.0
def theta_union_agg(e: Column): Column
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch, generated by the union of Datasketches ThetaSketch instances in the input column via a Datasketches Union instance.
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch, generated by the union of Datasketches ThetaSketch instances in the input column via a Datasketches Union instance. It is configured with the default value of 12 for lgNomEntries.
Since
4.1.0
def theta_union_agg(columnName: String, lgNomEntries: Int): Column
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch, generated by the union of Datasketches ThetaSketch instances in the input column via a Datasketches Union instance.
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch, generated by the union of Datasketches ThetaSketch instances in the input column via a Datasketches Union instance. It allows the configuration of lgNomEntries log nominal entries for the union buffer.
Since
4.1.0
def theta_union_agg(e: Column, lgNomEntries: Int): Column
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch, generated by the union of Datasketches ThetaSketch instances in the input column via a Datasketches Union instance.
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch, generated by the union of Datasketches ThetaSketch instances in the input column via a Datasketches Union instance. It allows the configuration of lgNomEntries log nominal entries for the union buffer.
Since
4.1.0
def theta_union_agg(e: Column, lgNomEntries: Column): Column
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch, generated by the union of Datasketches ThetaSketch instances in the input column via a Datasketches Union instance.
Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch, generated by the union of Datasketches ThetaSketch instances in the input column via a Datasketches Union instance. It allows the configuration of lgNomEntries log nominal entries for the union buffer.
Since
4.1.0
def time_diff(unit: Column, start: Column, end: Column): Column
Returns the difference between two times, measured in specified units.
Returns the difference between two times, measured in specified units. Throws a SparkIllegalArgumentException, in case the specified unit is not supported.
unit
A STRING representing the unit of the time difference. Supported units are: "HOUR", "MINUTE", "SECOND", "MILLISECOND", and "MICROSECOND". The unit is case-insensitive.
start
A starting TIME.
end
An ending TIME.
returns
The difference between end and start times, measured in specified units.
Since
4.1.0
Note
If any of the inputs is NULL, the result is NULL.
def time_trunc(unit: Column, time: Column): Column
Returns time truncated to the unit.
Returns time truncated to the unit.
unit
A STRING representing the unit to truncate the time to. Supported units are: "HOUR", "MINUTE", "SECOND", "MILLISECOND", and "MICROSECOND". The unit is case-insensitive.
time
A TIME to truncate.
returns
A TIME truncated to the specified unit.
Since
4.1.0
Exceptions thrown
IllegalArgumentException If the unit is not supported.
Note
If any of the inputs is NULL, the result is NULL.
def timestamp_add(unit: String, quantity: Column, ts: Column): Column
Adds the specified number of units to the given timestamp.
Adds the specified number of units to the given timestamp.
Since
4.0.0
def timestamp_diff(unit: String, start: Column, end: Column): Column
Gets the difference between the timestamps in the specified units by truncating the fraction part.
Gets the difference between the timestamps in the specified units by truncating the fraction part.
Since
4.0.0
def timestamp_micros(e: Column): Column
Creates timestamp from the number of microseconds since UTC epoch.
Creates timestamp from the number of microseconds since UTC epoch.
Since
3.5.0
def timestamp_millis(e: Column): Column
Creates timestamp from the number of milliseconds since UTC epoch.
Creates timestamp from the number of milliseconds since UTC epoch.
Since
3.5.0
def timestamp_seconds(e: Column): Column
Converts the number of seconds from the Unix epoch (1970-01-01T00:00:00Z) to a timestamp.
Converts the number of seconds from the Unix epoch (1970-01-01T00:00:00Z) to a timestamp.
Since
3.1.0
def toString(): String
Definition Classes
AnyRef → Any
def to_binary(e: Column): Column
Converts the input e to a binary value based on the default format "hex".
Converts the input e to a binary value based on the default format "hex". The function returns NULL if at least one of the input parameters is NULL.
Since
3.5.0
def to_binary(e: Column, f: Column): Column
Converts the input e to a binary value based on the supplied format.
Converts the input e to a binary value based on the supplied format. The format can be a case-insensitive string literal of "hex", "utf-8", "utf8", or "base64". By default, the binary format for conversion is "hex" if format is omitted. The function returns NULL if at least one of the input parameters is NULL.
Since
3.5.0
def to_char(e: Column, format: Column): Column
Convert e to a string based on the format.
Convert e to a string based on the format. Throws an exception if the conversion fails. The format can consist of the following characters, case insensitive: '0' or '9': Specifies an expected digit between 0 and 9. A sequence of 0 or 9 in the format string matches a sequence of digits in the input value, generating a result string of the same length as the corresponding sequence in the format string. The result string is left-padded with zeros if the 0/9 sequence comprises more digits than the matching part of the decimal value, starts with 0, and is before the decimal point. Otherwise, it is padded with spaces. '.' or 'D': Specifies the position of the decimal point (optional, only allowed once). ',' or 'G': Specifies the position of the grouping (thousands) separator (,). There must be a 0 or 9 to the left and right of each grouping separator. '$': Specifies the location of the $ currency sign. This character may only be specified once. 'S' or 'MI': Specifies the position of a '-' or '+' sign (optional, only allowed once at the beginning or end of the format string). Note that 'S' prints '+' for positive values but 'MI' prints a space. 'PR': Only allowed at the end of the format string; specifies that the result string will be wrapped by angle brackets if the input value is negative.
If e is a datetime, format shall be a valid datetime pattern, see <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html">Datetime Patterns. If e is a binary, it is converted to a string in one of the formats: 'base64': a base 64 string. 'hex': a string in the hexadecimal format. 'utf-8': the input binary is decoded to UTF-8 string.
Since
3.5.0
def to_csv(e: Column): Column
Converts a column containing a StructType into a CSV string with the specified schema.
Converts a column containing a StructType into a CSV string with the specified schema. Throws an exception, in the case of an unsupported type.
e
a column containing a struct.
Since
3.0.0
def to_csv(e: Column, options: Map[String, String]): Column
(Java-specific) Converts a column containing a StructType into a CSV string with the specified schema.
(Java-specific) Converts a column containing a StructType into a CSV string with the specified schema. Throws an exception, in the case of an unsupported type.
e
a column containing a struct.
options
options to control how the struct column is converted into a CSV string. It accepts the same options and the CSV data source. See Data Source Option in the version you use.
Since
3.0.0
def to_date(e: Column, fmt: String): Column
Converts the column into a DateType with a specified format
Converts the column into a DateType with a specified format
See Datetime Patterns for valid date and time format patterns
e
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
fmt
A date time pattern detailing the format of e when eis a string
returns
A date, or null if e was a string that could not be cast to a date or fmt was an invalid format
Since
2.2.0
def to_date(e: Column): Column
Converts the column into DateType by casting rules to DateType.
Converts the column into DateType by casting rules to DateType.
Since
1.5.0
def to_json(e: Column): Column
Converts a column containing a StructType, ArrayType or a MapType into a JSON string with the specified schema.
Converts a column containing a StructType, ArrayType or a MapType into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type.
e
a column containing a struct, an array or a map.
Since
2.1.0
def to_json(e: Column, options: Map[String, String]): Column
(Java-specific) Converts a column containing a StructType, ArrayType or a MapType into a JSON string with the specified schema.
(Java-specific) Converts a column containing a StructType, ArrayType or a MapType into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type.
e
a column containing a struct, an array or a map.
options
options to control how the struct column is converted into a json string. accepts the same options and the json data source. See Data Source Option in the version you use. Additionally the function supports the pretty option which enables pretty JSON generation.
Since
2.1.0
def to_json(e: Column, options: Map[String, String]): Column
(Scala-specific) Converts a column containing a StructType, ArrayType or a MapType into a JSON string with the specified schema.
(Scala-specific) Converts a column containing a StructType, ArrayType or a MapType into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type.
e
a column containing a struct, an array or a map.
options
options to control how the struct column is converted into a json string. accepts the same options and the json data source. See Data Source Option in the version you use. Additionally the function supports the pretty option which enables pretty JSON generation.
Since
2.1.0
def to_number(e: Column, format: Column): Column
Convert string 'e' to a number based on the string format 'format'.
Convert string 'e' to a number based on the string format 'format'. Throws an exception if the conversion fails. The format can consist of the following characters, case insensitive: '0' or '9': Specifies an expected digit between 0 and 9. A sequence of 0 or 9 in the format string matches a sequence of digits in the input string. If the 0/9 sequence starts with 0 and is before the decimal point, it can only match a digit sequence of the same size. Otherwise, if the sequence starts with 9 or is after the decimal point, it can match a digit sequence that has the same or smaller size. '.' or 'D': Specifies the position of the decimal point (optional, only allowed once). ',' or 'G': Specifies the position of the grouping (thousands) separator (,). There must be a 0 or 9 to the left and right of each grouping separator. 'expr' must match the grouping separator relevant for the size of the number. '$': Specifies the location of the $ currency sign. This character may only be specified once. 'S' or 'MI': Specifies the position of a '-' or '+' sign (optional, only allowed once at the beginning or end of the format string). Note that 'S' allows '-' but 'MI' does not. 'PR': Only allowed at the end of the format string; specifies that 'expr' indicates a negative number with wrapping angled brackets.
Since
3.5.0
def to_time(str: Column, format: Column): Column
Parses a string value to a time value.
Parses a string value to a time value.
See Datetime Patterns for valid time format patterns.
str
A string to be parsed to time.
format
A time format pattern to follow.
returns
A time, or raises an error if the input is malformed.
Since
4.1.0
def to_time(str: Column): Column
Parses a string value to a time value.
Parses a string value to a time value.
str
A string to be parsed to time.
returns
A time, or raises an error if the input is malformed.
Since
4.1.0
def to_timestamp(s: Column, fmt: String): Column
Converts time string with the given pattern to timestamp.
Converts time string with the given pattern to timestamp.
See Datetime Patterns for valid date and time format patterns
s
A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
fmt
A date time pattern detailing the format of s when s is a string
returns
A timestamp, or null if s was a string that could not be cast to a timestamp or fmt was an invalid format
Since
2.2.0
def to_timestamp(s: Column): Column
Converts to a timestamp by casting rules to TimestampType.
Converts to a timestamp by casting rules to TimestampType.
s
A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
returns
A timestamp, or null if the input was a string that could not be cast to a timestamp
Since
2.2.0
def to_timestamp_ltz(timestamp: Column): Column
Parses the timestamp expression with the default format to a timestamp without time zone.
Parses the timestamp expression with the default format to a timestamp without time zone. The default format follows casting rules to a timestamp. Returns null with invalid input.
Since
3.5.0
def to_timestamp_ltz(timestamp: Column, format: Column): Column
Parses the timestamp expression with the format expression to a timestamp without time zone.
Parses the timestamp expression with the format expression to a timestamp without time zone. Returns null with invalid input.
Since
3.5.0
def to_timestamp_ntz(timestamp: Column): Column
Parses the timestamp expression with the default format to a timestamp without time zone.
Parses the timestamp expression with the default format to a timestamp without time zone. The default format follows casting rules to a timestamp. Returns null with invalid input.
Since
3.5.0
def to_timestamp_ntz(timestamp: Column, format: Column): Column
Parses the timestamp_str expression with the format expression to a timestamp without time zone.
Parses the timestamp_str expression with the format expression to a timestamp without time zone. Returns null with invalid input.
Since
3.5.0
def to_unix_timestamp(timeExp: Column): Column
Returns the UNIX timestamp of the given time.
Returns the UNIX timestamp of the given time.
Since
3.5.0
def to_unix_timestamp(timeExp: Column, format: Column): Column
Returns the UNIX timestamp of the given time.
Returns the UNIX timestamp of the given time.
Since
3.5.0
def to_utc_timestamp(ts: Column, tz: Column): Column
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC.
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, 'GMT+1' would yield '2017-07-14 01:40:00.0'.
Since
2.4.0
def to_utc_timestamp(ts: Column, tz: String): Column
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC.
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, 'GMT+1' would yield '2017-07-14 01:40:00.0'.
ts
A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
tz
A string detailing the time zone ID that the input should be adjusted to. It should be in the format of either region-based zone IDs or zone offsets. Region IDs must have the form 'area/city', such as 'America/Los_Angeles'. Zone offsets must be in the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'. Other short names are not recommended to use because they can be ambiguous.
returns
A timestamp, or null if ts was a string that could not be cast to a timestamp or tz was an invalid value
Since
1.5.0
def to_varchar(e: Column, format: Column): Column
Convert e to a string based on the format.
Convert e to a string based on the format. Throws an exception if the conversion fails. The format can consist of the following characters, case insensitive: '0' or '9': Specifies an expected digit between 0 and 9. A sequence of 0 or 9 in the format string matches a sequence of digits in the input value, generating a result string of the same length as the corresponding sequence in the format string. The result string is left-padded with zeros if the 0/9 sequence comprises more digits than the matching part of the decimal value, starts with 0, and is before the decimal point. Otherwise, it is padded with spaces. '.' or 'D': Specifies the position of the decimal point (optional, only allowed once). ',' or 'G': Specifies the position of the grouping (thousands) separator (,). There must be a 0 or 9 to the left and right of each grouping separator. '$': Specifies the location of the $ currency sign. This character may only be specified once. 'S' or 'MI': Specifies the position of a '-' or '+' sign (optional, only allowed once at the beginning or end of the format string). Note that 'S' prints '+' for positive values but 'MI' prints a space. 'PR': Only allowed at the end of the format string; specifies that the result string will be wrapped by angle brackets if the input value is negative.
If e is a datetime, format shall be a valid datetime pattern, see <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html">Datetime Patterns. If e is a binary, it is converted to a string in one of the formats: 'base64': a base 64 string. 'hex': a string in the hexadecimal format. 'utf-8': the input binary is decoded to UTF-8 string.
Since
3.5.0
def to_variant_object(col: Column): Column
Converts a column containing nested inputs (array/map/struct) into a variants where maps and structs are converted to variant objects which are unordered unlike SQL structs.
Converts a column containing nested inputs (array/map/struct) into a variants where maps and structs are converted to variant objects which are unordered unlike SQL structs. Input maps can only have string keys.
col
a column with a nested schema or column name.
Since
4.0.0
def to_xml(e: Column): Column
Converts a column containing a StructType into a XML string with the specified schema.
Converts a column containing a StructType into a XML string with the specified schema. Throws an exception, in the case of an unsupported type.
e
a column containing a struct.
Since
4.0.0
def to_xml(e: Column, options: Map[String, String]): Column
(Java-specific) Converts a column containing a StructType into a XML string with the specified schema.
(Java-specific) Converts a column containing a StructType into a XML string with the specified schema. Throws an exception, in the case of an unsupported type.
e
a column containing a struct.
options
options to control how the struct column is converted into a XML string. It accepts the same options as the XML data source. See Data Source Option in the version you use.
Since
4.0.0
def transform(column: Column, f: (Column, Column) => Column): Column
Returns an array of elements after applying a transformation to each element in the input array.
Returns an array of elements after applying a transformation to each element in the input array.
```
df.select(transform(col("i"), (x, i) => x + i))
```
column
the input array column
f
(col, index) => transformed_col, the lambda function to transform the input column given the index. Indices start at 0.
Since
3.0.0
def transform(column: Column, f: (Column) => Column): Column
Returns an array of elements after applying a transformation to each element in the input array.
Returns an array of elements after applying a transformation to each element in the input array.
```
df.select(transform(col("i"), x => x + 1))
```
column
the input array column
f
col => transformed_col, the lambda function to transform the input column
Since
3.0.0
def transform_keys(expr: Column, f: (Column, Column) => Column): Column
Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new keys for the pairs.
Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new keys for the pairs.
```
df.select(transform_keys(col("i"), (k, v) => k + v))
```
expr
the input map column
f
(key, value) => new_key, the lambda function to transform the key of input map column
Since
3.0.0
def transform_values(expr: Column, f: (Column, Column) => Column): Column
Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new values for the pairs.
Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new values for the pairs.
```
df.select(transform_values(col("i"), (k, v) => k + v))
```
expr
the input map column
f
(key, value) => new_value, the lambda function to transform the value of input map column
Since
3.0.0
def translate(src: Column, matchingString: String, replaceString: String): Column
Translate any character in the src by a character in replaceString.
Translate any character in the src by a character in replaceString. The characters in replaceString correspond to the characters in matchingString. The translate will happen when any character in the string matches the character in the matchingString.
Since
1.5.0
def trim(e: Column, trim: Column): Column
Trim the specified character from both ends for the specified string column.
Trim the specified character from both ends for the specified string column.
Since
4.0.0
def trim(e: Column, trimString: String): Column
Trim the specified character from both ends for the specified string column.
Trim the specified character from both ends for the specified string column.
Since
2.3.0
def trim(e: Column): Column
Trim the spaces from both ends for the specified string column.
Trim the spaces from both ends for the specified string column.
Since
1.5.0
def trunc(date: Column, format: String): Column
Returns date truncated to the unit specified by the format.
Returns date truncated to the unit specified by the format.
For example, trunc("2018-11-19 12:01:19", "year") returns 2018-01-01
date
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
returns
A date, or null if date was a string that could not be cast to a date or format was an invalid value
Since
1.5.0
def try_add(left: Column, right: Column): Column
Returns the sum of left and right and the result is null on overflow.
Returns the sum of left and right and the result is null on overflow. The acceptable input types are the same with the + operator.
Since
3.5.0
def try_aes_decrypt(input: Column, key: Column): Column
Returns a decrypted value of input.
Returns a decrypted value of input.
Since
3.5.0
See also
org.apache.spark.sql.functions.try_aes_decrypt(Column, Column, Column, Column, Column)
def try_aes_decrypt(input: Column, key: Column, mode: Column): Column
Returns a decrypted value of input.
Returns a decrypted value of input.
Since
3.5.0
See also
org.apache.spark.sql.functions.try_aes_decrypt(Column, Column, Column, Column, Column)
def try_aes_decrypt(input: Column, key: Column, mode: Column, padding: Column): Column
Returns a decrypted value of input.
Returns a decrypted value of input.
Since
3.5.0
See also
org.apache.spark.sql.functions.try_aes_decrypt(Column, Column, Column, Column, Column)
def try_aes_decrypt(input: Column, key: Column, mode: Column, padding: Column, aad: Column): Column
This is a special version of aes_decrypt that performs the same operation, but returns a NULL value instead of raising an error if the decryption cannot be performed.
This is a special version of aes_decrypt that performs the same operation, but returns a NULL value instead of raising an error if the decryption cannot be performed.
input
The binary value to decrypt.
key
The passphrase to use to decrypt the data.
mode
Specifies which block cipher mode should be used to decrypt messages. Valid modes: ECB, GCM, CBC.
padding
Specifies how to pad messages whose length is not a multiple of the block size. Valid values: PKCS, NONE, DEFAULT. The DEFAULT padding means PKCS for ECB, NONE for GCM and PKCS for CBC.
aad
Optional additional authenticated data. Only supported for GCM mode. This can be any free-form input and must be provided for both encryption and decryption.
Since
3.5.0
def try_avg(e: Column): Column
Returns the mean calculated from values of a group and the result is null on overflow.
Returns the mean calculated from values of a group and the result is null on overflow.
Since
3.5.0
def try_divide(left: Column, right: Column): Column
Returns dividend/divisor.
Returns dividend/divisor. It always performs floating point division. Its result is always null if divisor is 0.
Since
3.5.0
def try_element_at(column: Column, value: Column): Column
(array, index) - Returns element of array at given (1-based) index.
(array, index) - Returns element of array at given (1-based) index. If Index is 0, Spark will throw an error. If index < 0, accesses elements from the last to the first. The function always returns NULL if the index exceeds the length of the array.
(map, key) - Returns value for given key. The function always returns NULL if the key is not contained in the map.
Since
3.5.0
def try_make_interval(years: Column): Column
This is a special version of make_interval that performs the same operation, but returns a NULL value instead of raising an error if interval cannot be created.
This is a special version of make_interval that performs the same operation, but returns a NULL value instead of raising an error if interval cannot be created.
Since
4.0.0
def try_make_interval(years: Column, months: Column): Column
This is a special version of make_interval that performs the same operation, but returns a NULL value instead of raising an error if interval cannot be created.
This is a special version of make_interval that performs the same operation, but returns a NULL value instead of raising an error if interval cannot be created.
Since
4.0.0
def try_make_interval(years: Column, months: Column, weeks: Column): Column
This is a special version of make_interval that performs the same operation, but returns a NULL value instead of raising an error if interval cannot be created.
This is a special version of make_interval that performs the same operation, but returns a NULL value instead of raising an error if interval cannot be created.
Since
4.0.0
def try_make_interval(years: Column, months: Column, weeks: Column, days: Column): Column
This is a special version of make_interval that performs the same operation, but returns a NULL value instead of raising an error if interval cannot be created.
This is a special version of make_interval that performs the same operation, but returns a NULL value instead of raising an error if interval cannot be created.
Since
4.0.0
def try_make_interval(years: Column, months: Column, weeks: Column, days: Column, hours: Column): Column
This is a special version of make_interval that performs the same operation, but returns a NULL value instead of raising an error if interval cannot be created.
This is a special version of make_interval that performs the same operation, but returns a NULL value instead of raising an error if interval cannot be created.
Since
4.0.0
def try_make_interval(years: Column, months: Column, weeks: Column, days: Column, hours: Column, mins: Column): Column
This is a special version of make_interval that performs the same operation, but returns a NULL value instead of raising an error if interval cannot be created.
This is a special version of make_interval that performs the same operation, but returns a NULL value instead of raising an error if interval cannot be created.
Since
4.0.0
def try_make_interval(years: Column, months: Column, weeks: Column, days: Column, hours: Column, mins: Column, secs: Column): Column
This is a special version of make_interval that performs the same operation, but returns a NULL value instead of raising an error if interval cannot be created.
This is a special version of make_interval that performs the same operation, but returns a NULL value instead of raising an error if interval cannot be created.
Since
4.0.0
def try_make_timestamp(date: Column, time: Column): Column
Try to create a local date-time from date and time fields.
Try to create a local date-time from date and time fields.
Since
4.1.0
def try_make_timestamp(date: Column, time: Column, timezone: Column): Column
Try to create a local date-time from date, time, and timezone fields.
Try to create a local date-time from date, time, and timezone fields.
Since
4.1.0
def try_make_timestamp(years: Column, months: Column, days: Column, hours: Column, mins: Column, secs: Column): Column
Try to create a timestamp from years, months, days, hours, mins, and secs fields.
Try to create a timestamp from years, months, days, hours, mins, and secs fields. The result data type is consistent with the value of configuration spark.sql.timestampType. The function returns NULL on invalid inputs.
Since
4.0.0
def try_make_timestamp(years: Column, months: Column, days: Column, hours: Column, mins: Column, secs: Column, timezone: Column): Column
Try to create a timestamp from years, months, days, hours, mins, secs and timezone fields.
Try to create a timestamp from years, months, days, hours, mins, secs and timezone fields. The result data type is consistent with the value of configuration spark.sql.timestampType. The function returns NULL on invalid inputs.
Since
4.0.0
def try_make_timestamp_ltz(years: Column, months: Column, days: Column, hours: Column, mins: Column, secs: Column): Column
Try to create the current timestamp with local time zone from years, months, days, hours, mins and secs fields.
Try to create the current timestamp with local time zone from years, months, days, hours, mins and secs fields. The function returns NULL on invalid inputs.
Since
4.0.0
def try_make_timestamp_ltz(years: Column, months: Column, days: Column, hours: Column, mins: Column, secs: Column, timezone: Column): Column
Try to create the current timestamp with local time zone from years, months, days, hours, mins, secs and timezone fields.
Try to create the current timestamp with local time zone from years, months, days, hours, mins, secs and timezone fields. The function returns NULL on invalid inputs.
Since
4.0.0
def try_make_timestamp_ntz(date: Column, time: Column): Column
Try to create a local date-time from date and time fields.
Try to create a local date-time from date and time fields.
Since
4.1.0
def try_make_timestamp_ntz(years: Column, months: Column, days: Column, hours: Column, mins: Column, secs: Column): Column
Try to create a local date-time from years, months, days, hours, mins, secs fields.
Try to create a local date-time from years, months, days, hours, mins, secs fields. The function returns NULL on invalid inputs.
Since
4.0.0
def try_mod(left: Column, right: Column): Column
Returns the remainder of dividend/divisor.
Returns the remainder of dividend/divisor. Its result is always null if divisor is 0.
Since
4.0.0
def try_multiply(left: Column, right: Column): Column
Returns left*right and the result is null on overflow.
Returns left*right and the result is null on overflow. The acceptable input types are the same with the * operator.
Since
3.5.0
def try_parse_json(json: Column): Column
Parses a JSON string and constructs a Variant value.
Parses a JSON string and constructs a Variant value. Returns null if the input string is not a valid JSON value.
json
a string column that contains JSON data.
Since
4.0.0
def try_parse_url(url: Column, partToExtract: Column): Column
Extracts a part from a URL.
Extracts a part from a URL.
Since
4.0.0
def try_parse_url(url: Column, partToExtract: Column, key: Column): Column
Extracts a part from a URL.
Extracts a part from a URL.
Since
4.0.0
def try_reflect(cols: Column*): Column
This is a special version of reflect that performs the same operation, but returns a NULL value instead of raising an error if the invoke method thrown exception.
This is a special version of reflect that performs the same operation, but returns a NULL value instead of raising an error if the invoke method thrown exception.
Annotations
@varargs()
Since
4.0.0
def try_subtract(left: Column, right: Column): Column
Returns left-right and the result is null on overflow.
Returns left-right and the result is null on overflow. The acceptable input types are the same with the - operator.
Since
3.5.0
def try_sum(e: Column): Column
Returns the sum calculated from values of a group and the result is null on overflow.
Returns the sum calculated from values of a group and the result is null on overflow.
Since
3.5.0
def try_to_binary(e: Column): Column
This is a special version of to_binary that performs the same operation, but returns a NULL value instead of raising an error if the conversion cannot be performed.
This is a special version of to_binary that performs the same operation, but returns a NULL value instead of raising an error if the conversion cannot be performed.
Since
3.5.0
def try_to_binary(e: Column, f: Column): Column
This is a special version of to_binary that performs the same operation, but returns a NULL value instead of raising an error if the conversion cannot be performed.
This is a special version of to_binary that performs the same operation, but returns a NULL value instead of raising an error if the conversion cannot be performed.
Since
3.5.0
def try_to_date(e: Column, fmt: String): Column
This is a special version of to_date that performs the same operation, but returns a NULL value instead of raising an error if date cannot be created.
This is a special version of to_date that performs the same operation, but returns a NULL value instead of raising an error if date cannot be created.
Since
4.0.0
def try_to_date(e: Column): Column
This is a special version of to_date that performs the same operation, but returns a NULL value instead of raising an error if date cannot be created.
This is a special version of to_date that performs the same operation, but returns a NULL value instead of raising an error if date cannot be created.
Since
4.0.0
def try_to_number(e: Column, format: Column): Column
Convert string e to a number based on the string format format.
Convert string e to a number based on the string format format. Returns NULL if the string e does not match the expected format. The format follows the same semantics as the to_number function.
Since
3.5.0
def try_to_time(str: Column, format: Column): Column
Parses a string value to a time value.
Parses a string value to a time value.
See Datetime Patterns for valid time format patterns.
str
A string to be parsed to time.
format
A time format pattern to follow.
returns
A time, or null if the input is malformed.
Since
4.1.0
def try_to_time(str: Column): Column
Parses a string value to a time value.
Parses a string value to a time value.
str
A string to be parsed to time.
returns
A time, or null if the input is malformed.
Since
4.1.0
def try_to_timestamp(s: Column): Column
Parses the s to a timestamp.
Parses the s to a timestamp. The function always returns null on an invalid input with/without ANSI SQL mode enabled. It follows casting rules to a timestamp. The result data type is consistent with the value of configuration spark.sql.timestampType.
Since
3.5.0
def try_to_timestamp(s: Column, format: Column): Column
Parses the s with the format to a timestamp.
Parses the s with the format to a timestamp. The function always returns null on an invalid input with/without ANSI SQL mode enabled. The result data type is consistent with the value of configuration spark.sql.timestampType.
Since
3.5.0
def try_url_decode(str: Column): Column
This is a special version of url_decode that performs the same operation, but returns a NULL value instead of raising an error if the decoding cannot be performed.
This is a special version of url_decode that performs the same operation, but returns a NULL value instead of raising an error if the decoding cannot be performed.
Since
4.0.0
def try_validate_utf8(str: Column): Column
Returns the input value if it corresponds to a valid UTF-8 string, or NULL otherwise.
Returns the input value if it corresponds to a valid UTF-8 string, or NULL otherwise.
Since
4.0.0
def try_variant_get(v: Column, path: Column, targetType: String): Column
Extracts a sub-variant from v according to path column, and then cast the sub-variant to targetType.
Extracts a sub-variant from v according to path column, and then cast the sub-variant to targetType. Returns null if the path does not exist or the cast fails..
v
a variant column.
path
the column containing the extraction path strings. A valid path string should start with $ and is followed by zero or more segments like [123], .name, ['name'], or ["name"].
targetType
the target data type to cast into, in a DDL-formatted string.
Since
4.0.0
def try_variant_get(v: Column, path: String, targetType: String): Column
Extracts a sub-variant from v according to path string, and then cast the sub-variant to targetType.
Extracts a sub-variant from v according to path string, and then cast the sub-variant to targetType. Returns null if the path does not exist or the cast fails..
v
a variant column.
path
the extraction path. A valid path should start with $ and is followed by zero or more segments like [123], .name, ['name'], or ["name"].
targetType
the target data type to cast into, in a DDL-formatted string.
Since
4.0.0
def typedLit[T](literal: T)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[T]): Column
Creates a Column of literal value.
Creates a Column of literal value.
An alias of typedlit, and it is encouraged to use typedlit directly.
Since
2.2.0
def typedlit[T](literal: T)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[T]): Column
Creates a Column of literal value.
Creates a Column of literal value.
The passed in object is returned directly if it is already a Column. If the object is a Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value. The difference between this function and lit is that this function can handle parameterized scala types e.g.: List, Seq and Map.
Since
3.2.0
Note
typedlit will call expensive Scala reflection APIs. lit is preferred if parameterized Scala types are not used.
def typeof(col: Column): Column
Return DDL-formatted type string for the data type of the input.
Return DDL-formatted type string for the data type of the input.
Since
3.5.0
def ucase(str: Column): Column
Returns str with all characters changed to uppercase.
Returns str with all characters changed to uppercase.
Since
3.5.0
def udaf[IN, BUF, OUT](agg: expressions.Aggregator[IN, BUF, OUT], inputEncoder: Encoder[IN]): UserDefinedFunction
Obtains a UserDefinedFunction that wraps the given Aggregator so that it may be used with untyped Data Frames.
Obtains a UserDefinedFunction that wraps the given Aggregator so that it may be used with untyped Data Frames.
```
Aggregator<IN, BUF, OUT> agg = // custom Aggregator
Encoder<IN> enc = // input encoder

// declare a UDF based on agg
UserDefinedFunction aggUDF = udaf(agg, enc)
DataFrame aggData = df.agg(aggUDF($"colname"))

// register agg as a named function
spark.udf.register("myAggName", udaf(agg, enc))
```
IN
the aggregator input type
BUF
the aggregating buffer type
OUT
the finalized output type
agg
the typed Aggregator
inputEncoder
a specific input encoder to use
returns
a UserDefinedFunction that can be used as an aggregating expression
Note
This overloading takes an explicit input encoder, to support UDAF declarations in Java.
def udaf[IN, BUF, OUT](agg: expressions.Aggregator[IN, BUF, OUT])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[IN]): UserDefinedFunction
Obtains a UserDefinedFunction that wraps the given Aggregator so that it may be used with untyped Data Frames.
Obtains a UserDefinedFunction that wraps the given Aggregator so that it may be used with untyped Data Frames.
```
val agg = // Aggregator[IN, BUF, OUT]

// declare a UDF based on agg
val aggUDF = udaf(agg)
val aggData = df.agg(aggUDF($"colname"))

// register agg as a named function
spark.udf.register("myAggName", udaf(agg))
```
IN
the aggregator input type
BUF
the aggregating buffer type
OUT
the finalized output type
agg
the typed Aggregator
returns
a UserDefinedFunction that can be used as an aggregating expression.
Note
The input encoder is inferred from the input type IN.
def udf(f: UDF10[_, _, _, _, _, _, _, _, _, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF10 instance as user-defined function (UDF).
Defines a Java UDF10 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
2.3.0
def udf(f: UDF9[_, _, _, _, _, _, _, _, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF9 instance as user-defined function (UDF).
Defines a Java UDF9 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
2.3.0
def udf(f: UDF8[_, _, _, _, _, _, _, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF8 instance as user-defined function (UDF).
Defines a Java UDF8 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
2.3.0
def udf(f: UDF7[_, _, _, _, _, _, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF7 instance as user-defined function (UDF).
Defines a Java UDF7 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
2.3.0
def udf(f: UDF6[_, _, _, _, _, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF6 instance as user-defined function (UDF).
Defines a Java UDF6 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
2.3.0
def udf(f: UDF5[_, _, _, _, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF5 instance as user-defined function (UDF).
Defines a Java UDF5 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
2.3.0
def udf(f: UDF4[_, _, _, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF4 instance as user-defined function (UDF).
Defines a Java UDF4 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
2.3.0
def udf(f: UDF3[_, _, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF3 instance as user-defined function (UDF).
Defines a Java UDF3 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
2.3.0
def udf(f: UDF2[_, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF2 instance as user-defined function (UDF).
Defines a Java UDF2 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
2.3.0
def udf(f: UDF1[_, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF1 instance as user-defined function (UDF).
Defines a Java UDF1 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
2.3.0
def udf(f: UDF0[_], returnType: DataType): UserDefinedFunction
Defines a Java UDF0 instance as user-defined function (UDF).
Defines a Java UDF0 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
2.3.0
def udf[RT, A1, A2, A3, A4, A5, A6, A7, A8, A9, A10](f: (A1, A2, A3, A4, A5, A6, A7, A8, A9, A10) => RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5], arg6: scala.reflect.api.JavaUniverse.TypeTag[A6], arg7: scala.reflect.api.JavaUniverse.TypeTag[A7], arg8: scala.reflect.api.JavaUniverse.TypeTag[A8], arg9: scala.reflect.api.JavaUniverse.TypeTag[A9], arg10: scala.reflect.api.JavaUniverse.TypeTag[A10]): UserDefinedFunction
Defines a Scala closure of 10 arguments as user-defined function (UDF).
Defines a Scala closure of 10 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
1.3.0
def udf[RT, A1, A2, A3, A4, A5, A6, A7, A8, A9](f: (A1, A2, A3, A4, A5, A6, A7, A8, A9) => RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5], arg6: scala.reflect.api.JavaUniverse.TypeTag[A6], arg7: scala.reflect.api.JavaUniverse.TypeTag[A7], arg8: scala.reflect.api.JavaUniverse.TypeTag[A8], arg9: scala.reflect.api.JavaUniverse.TypeTag[A9]): UserDefinedFunction
Defines a Scala closure of 9 arguments as user-defined function (UDF).
Defines a Scala closure of 9 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
1.3.0
def udf[RT, A1, A2, A3, A4, A5, A6, A7, A8](f: (A1, A2, A3, A4, A5, A6, A7, A8) => RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5], arg6: scala.reflect.api.JavaUniverse.TypeTag[A6], arg7: scala.reflect.api.JavaUniverse.TypeTag[A7], arg8: scala.reflect.api.JavaUniverse.TypeTag[A8]): UserDefinedFunction
Defines a Scala closure of 8 arguments as user-defined function (UDF).
Defines a Scala closure of 8 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
1.3.0
def udf[RT, A1, A2, A3, A4, A5, A6, A7](f: (A1, A2, A3, A4, A5, A6, A7) => RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5], arg6: scala.reflect.api.JavaUniverse.TypeTag[A6], arg7: scala.reflect.api.JavaUniverse.TypeTag[A7]): UserDefinedFunction
Defines a Scala closure of 7 arguments as user-defined function (UDF).
Defines a Scala closure of 7 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
1.3.0
def udf[RT, A1, A2, A3, A4, A5, A6](f: (A1, A2, A3, A4, A5, A6) => RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5], arg6: scala.reflect.api.JavaUniverse.TypeTag[A6]): UserDefinedFunction
Defines a Scala closure of 6 arguments as user-defined function (UDF).
Defines a Scala closure of 6 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
1.3.0
def udf[RT, A1, A2, A3, A4, A5](f: (A1, A2, A3, A4, A5) => RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5]): UserDefinedFunction
Defines a Scala closure of 5 arguments as user-defined function (UDF).
Defines a Scala closure of 5 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
1.3.0
def udf[RT, A1, A2, A3, A4](f: (A1, A2, A3, A4) => RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4]): UserDefinedFunction
Defines a Scala closure of 4 arguments as user-defined function (UDF).
Defines a Scala closure of 4 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
1.3.0
def udf[RT, A1, A2, A3](f: (A1, A2, A3) => RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3]): UserDefinedFunction
Defines a Scala closure of 3 arguments as user-defined function (UDF).
Defines a Scala closure of 3 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
1.3.0
def udf[RT, A1, A2](f: (A1, A2) => RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2]): UserDefinedFunction
Defines a Scala closure of 2 arguments as user-defined function (UDF).
Defines a Scala closure of 2 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
1.3.0
def udf[RT, A1](f: (A1) => RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1]): UserDefinedFunction
Defines a Scala closure of 1 arguments as user-defined function (UDF).
Defines a Scala closure of 1 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
1.3.0
def udf[RT](f: () => RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT]): UserDefinedFunction
Defines a Scala closure of 0 arguments as user-defined function (UDF).
Defines a Scala closure of 0 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Since
1.3.0
def unbase64(e: Column): Column
Decodes a BASE64 encoded string column and returns it as a binary column.
Decodes a BASE64 encoded string column and returns it as a binary column. This is the reverse of base64.
Since
1.5.0
def unhex(column: Column): Column
Inverse of hex.
Inverse of hex. Interprets each pair of characters as a hexadecimal number and converts to the byte representation of number.
Since
1.5.0
def uniform(min: Column, max: Column, seed: Column): Column
Returns a random value with independent and identically distributed (i.i.d.) values with the specified range of numbers, with the chosen random seed.
Returns a random value with independent and identically distributed (i.i.d.) values with the specified range of numbers, with the chosen random seed. The provided numbers specifying the minimum and maximum values of the range must be constant. If both of these numbers are integers, then the result will also be an integer. Otherwise if one or both of these are floating-point numbers, then the result will also be a floating-point number.
Since
4.0.0
def uniform(min: Column, max: Column): Column
Returns a random value with independent and identically distributed (i.i.d.) values with the specified range of numbers.
Returns a random value with independent and identically distributed (i.i.d.) values with the specified range of numbers. The provided numbers specifying the minimum and maximum values of the range must be constant. If both of these numbers are integers, then the result will also be an integer. Otherwise if one or both of these are floating-point numbers, then the result will also be a floating-point number.
Since
4.0.0
def unix_date(e: Column): Column
Returns the number of days since 1970-01-01.
Returns the number of days since 1970-01-01.
Since
3.5.0
def unix_micros(e: Column): Column
Returns the number of microseconds since 1970-01-01 00:00:00 UTC.
Returns the number of microseconds since 1970-01-01 00:00:00 UTC.
Since
3.5.0
def unix_millis(e: Column): Column
Returns the number of milliseconds since 1970-01-01 00:00:00 UTC.
Returns the number of milliseconds since 1970-01-01 00:00:00 UTC. Truncates higher levels of precision.
Since
3.5.0
def unix_seconds(e: Column): Column
Returns the number of seconds since 1970-01-01 00:00:00 UTC.
Returns the number of seconds since 1970-01-01 00:00:00 UTC. Truncates higher levels of precision.
Since
3.5.0
def unix_timestamp(s: Column, p: String): Column
Converts time string with given pattern to Unix timestamp (in seconds).
Converts time string with given pattern to Unix timestamp (in seconds).
See Datetime Patterns for valid date and time format patterns
s
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
p
A date time pattern detailing the format of s when s is a string
returns
A long, or null if s was a string that could not be cast to a date or p was an invalid format
Since
1.5.0
def unix_timestamp(s: Column): Column
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale.
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale.
s
A date, timestamp or string. If a string, the data must be in the yyyy-MM-dd HH:mm:ss format
returns
A long, or null if the input was a string not of the correct format
Since
1.5.0
def unix_timestamp(): Column
Returns the current Unix timestamp (in seconds) as a long.
Returns the current Unix timestamp (in seconds) as a long.
Since
1.5.0
Note
All calls of unix_timestamp within the same query return the same value (i.e. the current timestamp is calculated at the start of query evaluation).
def unwrap_udt(column: Column): Column
Unwrap UDT data type column into its underlying type.
Unwrap UDT data type column into its underlying type.
Since
3.4.0
def upper(e: Column): Column
Converts a string column to upper case.
Converts a string column to upper case.
Since
1.3.0
def url_decode(str: Column): Column
Decodes a str in 'application/x-www-form-urlencoded' format using a specific encoding scheme.
Decodes a str in 'application/x-www-form-urlencoded' format using a specific encoding scheme.
Since
3.5.0
def url_encode(str: Column): Column
Translates a string into 'application/x-www-form-urlencoded' format using a specific encoding scheme.
Translates a string into 'application/x-www-form-urlencoded' format using a specific encoding scheme.
Since
3.5.0
def user(): Column
Returns the user name of current execution context.
Returns the user name of current execution context.
Since
3.5.0
def uuid(seed: Column): Column
Returns an universally unique identifier (UUID) string.
Returns an universally unique identifier (UUID) string. The value is returned as a canonical UUID 36-character string.
Since
4.1.0
def uuid(): Column
Returns an universally unique identifier (UUID) string.
Returns an universally unique identifier (UUID) string. The value is returned as a canonical UUID 36-character string.
Since
3.5.0
def validate_utf8(str: Column): Column
Returns the input value if it corresponds to a valid UTF-8 string, or emits a SparkIllegalArgumentException exception otherwise.
Returns the input value if it corresponds to a valid UTF-8 string, or emits a SparkIllegalArgumentException exception otherwise.
Since
4.0.0
def var_pop(columnName: String): Column
Aggregate function: returns the population variance of the values in a group.
Aggregate function: returns the population variance of the values in a group.
Since
1.6.0
def var_pop(e: Column): Column
Aggregate function: returns the population variance of the values in a group.
Aggregate function: returns the population variance of the values in a group.
Since
1.6.0
def var_samp(columnName: String): Column
Aggregate function: returns the unbiased variance of the values in a group.
Aggregate function: returns the unbiased variance of the values in a group.
Since
1.6.0
def var_samp(e: Column): Column
Aggregate function: returns the unbiased variance of the values in a group.
Aggregate function: returns the unbiased variance of the values in a group.
Since
1.6.0
def variance(columnName: String): Column
Aggregate function: alias for var_samp.
Aggregate function: alias for var_samp.
Since
1.6.0
def variance(e: Column): Column
Aggregate function: alias for var_samp.
Aggregate function: alias for var_samp.
Since
1.6.0
def variant_get(v: Column, path: Column, targetType: String): Column
Extracts a sub-variant from v according to path column, and then cast the sub-variant to targetType.
Extracts a sub-variant from v according to path column, and then cast the sub-variant to targetType. Returns null if the path does not exist. Throws an exception if the cast fails.
v
a variant column.
path
the column containing the extraction path strings. A valid path string should start with $ and is followed by zero or more segments like [123], .name, ['name'], or ["name"].
targetType
the target data type to cast into, in a DDL-formatted string.
Since
4.0.0
def variant_get(v: Column, path: String, targetType: String): Column
Extracts a sub-variant from v according to path string, and then cast the sub-variant to targetType.
Extracts a sub-variant from v according to path string, and then cast the sub-variant to targetType. Returns null if the path does not exist. Throws an exception if the cast fails.
v
a variant column.
path
the extraction path. A valid path should start with $ and is followed by zero or more segments like [123], .name, ['name'], or ["name"].
targetType
the target data type to cast into, in a DDL-formatted string.
Since
4.0.0
def version(): Column
Returns the Spark version.
Returns the Spark version. The string contains 2 fields, the first being a release version and the second being a git revision.
Since
3.5.0
final def wait(arg0: Long, arg1: Int): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
final def wait(arg0: Long): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException]) @native()
final def wait(): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
def weekday(e: Column): Column
Returns the day of the week for date/timestamp (0 = Monday, 1 = Tuesday, ..., 6 = Sunday).
Returns the day of the week for date/timestamp (0 = Monday, 1 = Tuesday, ..., 6 = Sunday).
Since
3.5.0
def weekofyear(e: Column): Column
Extracts the week number as an integer from a given date/timestamp/string.
Extracts the week number as an integer from a given date/timestamp/string.
A week is considered to start on a Monday and week 1 is the first week with more than 3 days, as defined by ISO 8601
returns
An integer, or null if the input was a string that could not be cast to a date
Since
1.5.0

def when(condition: Column, value: Any): Column

Evaluates a list of conditions and returns one of multiple possible result expressions.

Evaluates a list of conditions and returns one of multiple possible result expressions. If otherwise is not defined at the end, null is returned for unmatched conditions.

// Example: encoding gender string column into integer.

// Scala:
people.select(when(people("gender") === "male", 0)
  .when(people("gender") === "female", 1)
  .otherwise(2))

// Java:
people.select(when(col("gender").equalTo("male"), 0)
  .when(col("gender").equalTo("female"), 1)
  .otherwise(2))

Since: 1.4.0

def width_bucket(v: Column, min: Column, max: Column, numBucket: Column): Column
Returns the bucket number into which the value of this expression would fall after being evaluated.
Returns the bucket number into which the value of this expression would fall after being evaluated. Note that input arguments must follow conditions listed below; otherwise, the method will return null.
v
value to compute a bucket number in the histogram
min
minimum value of the histogram
max
maximum value of the histogram
numBucket
the number of buckets
returns
the bucket number into which the value would fall after being evaluated
Since
3.5.0
def window(timeColumn: Column, windowDuration: String): Column
Generates tumbling time windows given a timestamp specifying column.
Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC. The following example takes the average stock price for a one minute tumbling window:
```
val df = ... // schema => timestamp: TimestampType, stockId: StringType, price: DoubleType
df.groupBy(window($"timestamp", "1 minute"), $"stockId")
  .agg(mean("price"))
```
The windows will look like:
```
09:00:00-09:01:00
09:01:00-09:02:00
09:02:00-09:03:00 ...
```
For a streaming query, you may use the function current_timestamp to generate windows on processing time.
timeColumn
The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType.
windowDuration
A string specifying the width of the window, e.g. 10 minutes, 1 second. Check org.apache.spark.unsafe.types.CalendarInterval for valid duration identifiers.
Since
2.0.0
def window(timeColumn: Column, windowDuration: String, slideDuration: String): Column
Bucketize rows into one or more time windows given a timestamp specifying column.
Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC. The following example takes the average stock price for a one minute window every 10 seconds:
```
val df = ... // schema => timestamp: TimestampType, stockId: StringType, price: DoubleType
df.groupBy(window($"timestamp", "1 minute", "10 seconds"), $"stockId")
  .agg(mean("price"))
```
The windows will look like:
```
09:00:00-09:01:00
09:00:10-09:01:10
09:00:20-09:01:20 ...
```
For a streaming query, you may use the function current_timestamp to generate windows on processing time.
timeColumn
The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType.
windowDuration
A string specifying the width of the window, e.g. 10 minutes, 1 second. Check org.apache.spark.unsafe.types.CalendarInterval for valid duration identifiers. Note that the duration is a fixed length of time, and does not vary over time according to a calendar. For example, 1 day always means 86,400,000 milliseconds, not a calendar day.
slideDuration
A string specifying the sliding interval of the window, e.g. 1 minute. A new window will be generated every slideDuration. Must be less than or equal to the windowDuration. Check org.apache.spark.unsafe.types.CalendarInterval for valid duration identifiers. This duration is likewise absolute, and does not vary according to a calendar.
Since
2.0.0
def window(timeColumn: Column, windowDuration: String, slideDuration: String, startTime: String): Column
Bucketize rows into one or more time windows given a timestamp specifying column.
Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The following example takes the average stock price for a one minute window every 10 seconds starting 5 seconds after the hour:
```
val df = ... // schema => timestamp: TimestampType, stockId: StringType, price: DoubleType
df.groupBy(window($"timestamp", "1 minute", "10 seconds", "5 seconds"), $"stockId")
  .agg(mean("price"))
```
The windows will look like:
```
09:00:05-09:01:05
09:00:15-09:01:15
09:00:25-09:01:25 ...
```
For a streaming query, you may use the function current_timestamp to generate windows on processing time.
timeColumn
The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType.
windowDuration
A string specifying the width of the window, e.g. 10 minutes, 1 second. Check org.apache.spark.unsafe.types.CalendarInterval for valid duration identifiers. Note that the duration is a fixed length of time, and does not vary over time according to a calendar. For example, 1 day always means 86,400,000 milliseconds, not a calendar day.
slideDuration
A string specifying the sliding interval of the window, e.g. 1 minute. A new window will be generated every slideDuration. Must be less than or equal to the windowDuration. Check org.apache.spark.unsafe.types.CalendarInterval for valid duration identifiers. This duration is likewise absolute, and does not vary according to a calendar.
startTime
The offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes.
Since
2.0.0
def window_time(windowColumn: Column): Column
Extracts the event time from the window column.
Extracts the event time from the window column.
The window column is of StructType { start: Timestamp, end: Timestamp } where start is inclusive and end is exclusive. Since event time can support microsecond precision, window_time(window) = window.end - 1 microsecond.
windowColumn
The window column (typically produced by window aggregation) of type StructType { start: Timestamp, end: Timestamp }
Since
3.4.0
def xpath(xml: Column, path: Column): Column
Returns a string array of values within the nodes of xml that match the XPath expression.
Returns a string array of values within the nodes of xml that match the XPath expression.
Since
3.5.0
def xpath_boolean(xml: Column, path: Column): Column
Returns true if the XPath expression evaluates to true, or if a matching node is found.
Returns true if the XPath expression evaluates to true, or if a matching node is found.
Since
3.5.0
def xpath_double(xml: Column, path: Column): Column
Returns a double value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.
Returns a double value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.
Since
3.5.0
def xpath_float(xml: Column, path: Column): Column
Returns a float value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.
Returns a float value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.
Since
3.5.0
def xpath_int(xml: Column, path: Column): Column
Returns an integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.
Returns an integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.
Since
3.5.0
def xpath_long(xml: Column, path: Column): Column
Returns a long integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.
Returns a long integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.
Since
3.5.0
def xpath_number(xml: Column, path: Column): Column
Returns a double value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.
Returns a double value, the value zero if no match is found, or NaN if a match is found but the value is non-numeric.
Since
3.5.0
def xpath_short(xml: Column, path: Column): Column
Returns a short integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.
Returns a short integer value, or the value zero if no match is found, or a match is found but the value is non-numeric.
Since
3.5.0
def xpath_string(xml: Column, path: Column): Column
Returns the text contents of the first xml node that matches the XPath expression.
Returns the text contents of the first xml node that matches the XPath expression.
Since
3.5.0
def xxhash64(cols: Column*): Column
Calculates the hash code of given columns using the 64-bit variant of the xxHash algorithm, and returns the result as a long column.
Calculates the hash code of given columns using the 64-bit variant of the xxHash algorithm, and returns the result as a long column. The hash computation uses an initial seed of 42.
Annotations
@varargs()
Since
3.0.0
def year(e: Column): Column
Extracts the year as an integer from a given date/timestamp/string.
Extracts the year as an integer from a given date/timestamp/string.
returns
An integer, or null if the input was a string that could not be cast to a date
Since
1.5.0
def years(e: Column): Column
(Java-specific) A transform for timestamps and dates to partition data into years.
(Java-specific) A transform for timestamps and dates to partition data into years.
Since
3.0.0
def zeroifnull(col: Column): Column
Returns zero if col is null, or col otherwise.
Returns zero if col is null, or col otherwise.
Since
4.0.0
def zip_with(left: Column, right: Column, f: (Column, Column) => Column): Column
Merge two given arrays, element-wise, into a single array using a function.
Merge two given arrays, element-wise, into a single array using a function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying the function.
```
df.select(zip_with(df1("val1"), df1("val2"), (x, y) => x + y))
```
left
the left input array column
right
the right input array column
f
(lCol, rCol) => col, the lambda function to merge two input columns into one column
Since
3.0.0
object partitioning

Deprecated Value Members

def approxCountDistinct(columnName: String, rsd: Double): Column
Annotations
@deprecated
Deprecated
(Since version 2.1.0) Use approx_count_distinct
Since
1.3.0
def approxCountDistinct(e: Column, rsd: Double): Column
Annotations
@deprecated
Deprecated
(Since version 2.1.0) Use approx_count_distinct
Since
1.3.0
def approxCountDistinct(columnName: String): Column
Annotations
@deprecated
Deprecated
(Since version 2.1.0) Use approx_count_distinct
Since
1.3.0
def approxCountDistinct(e: Column): Column
Annotations
@deprecated
Deprecated
(Since version 2.1.0) Use approx_count_distinct
Since
1.3.0
def bitwiseNOT(e: Column): Column
Computes bitwise NOT (~) of a number.
Computes bitwise NOT (~) of a number.
Annotations
@deprecated
Deprecated
(Since version 3.2.0) Use bitwise_not
Since
1.4.0
def callUDF(udfName: String, cols: Column*): Column
Call an user-defined function.
Call an user-defined function.
Annotations
@varargs() @deprecated
Deprecated
Use call_udf
Since
1.5.0
def finalize(): Unit
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.Throwable]) @Deprecated
Deprecated
(Since version 9)
def monotonicallyIncreasingId(): Column
A column expression that generates monotonically increasing 64-bit integers.
A column expression that generates monotonically increasing 64-bit integers.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs:
```
0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
```
Annotations
@deprecated
Deprecated
(Since version 2.0.0) Use monotonically_increasing_id()
Since
1.4.0
def shiftLeft(e: Column, numBits: Int): Column
Shift the given value numBits left.
Shift the given value numBits left. If the given value is a long value, this function will return a long value else it will return an integer value.
Annotations
@deprecated
Deprecated
(Since version 3.2.0) Use shiftleft
Since
1.5.0
def shiftRight(e: Column, numBits: Int): Column
(Signed) shift the given value numBits right.
(Signed) shift the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value.
Annotations
@deprecated
Deprecated
(Since version 3.2.0) Use shiftright
Since
1.5.0
def shiftRightUnsigned(e: Column, numBits: Int): Column
Unsigned shift the given value numBits right.
Unsigned shift the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value.
Annotations
@deprecated
Deprecated
(Since version 3.2.0) Use shiftrightunsigned
Since
1.5.0
def sumDistinct(columnName: String): Column
Aggregate function: returns the sum of distinct values in the expression.
Aggregate function: returns the sum of distinct values in the expression.
Annotations
@deprecated
Deprecated
(Since version 3.2.0) Use sum_distinct
Since
1.3.0
def sumDistinct(e: Column): Column
Aggregate function: returns the sum of distinct values in the expression.
Aggregate function: returns the sum of distinct values in the expression.
Annotations
@deprecated
Deprecated
(Since version 3.2.0) Use sum_distinct
Since
1.3.0
def toDegrees(columnName: String): Column
Annotations
@deprecated
Deprecated
(Since version 2.1.0) Use degrees
Since
1.4.0
def toDegrees(e: Column): Column
Annotations
@deprecated
Deprecated
(Since version 2.1.0) Use degrees
Since
1.4.0
def toRadians(columnName: String): Column
Annotations
@deprecated
Deprecated
(Since version 2.1.0) Use radians
Since
1.4.0
def toRadians(e: Column): Column
Annotations
@deprecated
Deprecated
(Since version 2.1.0) Use radians
Since
1.4.0
def udf(f: AnyRef, dataType: DataType): UserDefinedFunction
Defines a deterministic user-defined function (UDF) using a Scala closure.
Defines a deterministic user-defined function (UDF) using a Scala closure. For this variant, the caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Note that, although the Scala closure can have primitive-type function argument, it doesn't work well with null values. Because the Scala closure is passed in as Any type, there is no type information for the function arguments. Without the type information, Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. udf((x: Int) => x, IntegerType), the result is 0 for null input.
f
A closure in Scala
dataType
The output data type of the UDF
Annotations
@deprecated
Deprecated
(Since version 3.0.0) Scala udf method with return type parameter is deprecated. Please use Scala udf method without return type parameter.
Since
2.0.0

Packages

functions

object functions

Value Members

Deprecated Value Members

Inherited from AnyRef

Inherited from Any

Aggregate functions

Array functions

Bitwise functions

Collection functions

Conditional functions

CSV functions

Date and Timestamp functions

Generator functions

Hash functions

JSON functions

Map functions

Mathematical functions

Misc functions

Normal functions

Partition transform functions

Predicate functions

Sort functions

String functions

Struct functions

UDF, UDAF and UDT

URL functions

VARIANT functions

Window functions

XML functions

Support functions for DataFrames

Packages

functions

object functions

Value Members

Deprecated Value Members

Inherited from AnyRef

Inherited from Any

Aggregate functions

Array functions

Bitwise functions

Collection functions

Conditional functions

CSV functions

Date and Timestamp functions

Generator functions

Hash functions

JSON functions

Map functions

Mathematical functions

Misc functions

Normal functions

Partition transform functions

Predicate functions

Sort functions

String functions

Struct functions

UDF, UDAF and UDT

URL functions

VARIANT functions

Window functions

XML functions

Support functions for DataFrames

functions