Spark 3.5.2 ScalaDoc - org.apache.spark.sql.functions

final def !=(arg0: Any): Boolean

Definition Classes: AnyRef → Any

final def ##(): Int

Definition Classes: AnyRef → Any

final def ==(arg0: Any): Boolean

Definition Classes: AnyRef → Any

def abs(e: Column): Column

Computes the absolute value of a numeric value.

Since: 1.3.0

def acos(columnName: String): Column

returns: inverse cosine of columnName, as if computed by java.lang.Math.acos

Since: 1.4.0

def acos(e: Column): Column

returns: inverse cosine of e in radians, as if computed by java.lang.Math.acos

Since: 1.4.0

def acosh(columnName: String): Column

returns: inverse hyperbolic cosine of columnName

Since: 3.1.0

def acosh(e: Column): Column

returns: inverse hyperbolic cosine of e

Since: 3.1.0

def add_months(startDate: Column, numMonths: Column): Column

Returns the date that is numMonths after startDate.

startDate: A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
numMonths: A column of the number of months to add to startDate, can be negative to subtract months
returns: A date, or null if startDate was a string that could not be cast to a date

Since: 3.0.0

def add_months(startDate: Column, numMonths: Int): Column

Returns the date that is numMonths after startDate.

startDate: A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
numMonths: The number of months to add to startDate, can be negative to subtract months
returns: A date, or null if startDate was a string that could not be cast to a date

Since: 1.5.0

def aes_decrypt(input: Column, key: Column): Column

Returns a decrypted value of input.

Since: 3.5.0
See also: org.apache.spark.sql.functions.aes_decrypt(Column, Column, Column, Column, Column)

def aes_decrypt(input: Column, key: Column, mode: Column): Column

Returns a decrypted value of input.

Since: 3.5.0
See also: org.apache.spark.sql.functions.aes_decrypt(Column, Column, Column, Column, Column)

def aes_decrypt(input: Column, key: Column, mode: Column, padding: Column): Column

Returns a decrypted value of input.

Since: 3.5.0
See also: org.apache.spark.sql.functions.aes_decrypt(Column, Column, Column, Column, Column)

def aes_decrypt(input: Column, key: Column, mode: Column, padding: Column, aad: Column): Column

Returns a decrypted value of input using AES in mode with padding.

Returns a decrypted value of input using AES in mode with padding. Key lengths of 16, 24 and 32 bits are supported. Supported combinations of (mode, padding) are ('ECB', 'PKCS'), ('GCM', 'NONE') and ('CBC', 'PKCS'). Optional additional authenticated data (AAD) is only supported for GCM. If provided for encryption, the identical AAD value must be provided for decryption. The default mode is GCM.

input: The binary value to decrypt.
key: The passphrase to use to decrypt the data.
mode: Specifies which block cipher mode should be used to decrypt messages. Valid modes: ECB, GCM, CBC.
padding: Specifies how to pad messages whose length is not a multiple of the block size. Valid values: PKCS, NONE, DEFAULT. The DEFAULT padding means PKCS for ECB, NONE for GCM and PKCS for CBC.
aad: Optional additional authenticated data. Only supported for GCM mode. This can be any free-form input and must be provided for both encryption and decryption.

Since: 3.5.0

def aes_encrypt(input: Column, key: Column): Column

Returns an encrypted value of input.

Since: 3.5.0
See also: org.apache.spark.sql.functions.aes_encrypt(Column, Column, Column, Column, Column, Column)

def aes_encrypt(input: Column, key: Column, mode: Column): Column

Returns an encrypted value of input.

Since: 3.5.0
See also: org.apache.spark.sql.functions.aes_encrypt(Column, Column, Column, Column, Column, Column)

def aes_encrypt(input: Column, key: Column, mode: Column, padding: Column): Column

Returns an encrypted value of input.

Since: 3.5.0
See also: org.apache.spark.sql.functions.aes_encrypt(Column, Column, Column, Column, Column, Column)

def aes_encrypt(input: Column, key: Column, mode: Column, padding: Column, iv: Column): Column

Returns an encrypted value of input.

Since: 3.5.0
See also: org.apache.spark.sql.functions.aes_encrypt(Column, Column, Column, Column, Column, Column)

def aes_encrypt(input: Column, key: Column, mode: Column, padding: Column, iv: Column, aad: Column): Column

Returns an encrypted value of input using AES in given mode with the specified padding.

Returns an encrypted value of input using AES in given mode with the specified padding. Key lengths of 16, 24 and 32 bits are supported. Supported combinations of (mode, padding) are ('ECB', 'PKCS'), ('GCM', 'NONE') and ('CBC', 'PKCS'). Optional initialization vectors (IVs) are only supported for CBC and GCM modes. These must be 16 bytes for CBC and 12 bytes for GCM. If not provided, a random vector will be generated and prepended to the output. Optional additional authenticated data (AAD) is only supported for GCM. If provided for encryption, the identical AAD value must be provided for decryption. The default mode is GCM.

input: The binary value to encrypt.
key: The passphrase to use to encrypt the data.
mode: Specifies which block cipher mode should be used to encrypt messages. Valid modes: ECB, GCM, CBC.
padding: Specifies how to pad messages whose length is not a multiple of the block size. Valid values: PKCS, NONE, DEFAULT. The DEFAULT padding means PKCS for ECB, NONE for GCM and PKCS for CBC.
iv: Optional initialization vector. Only supported for CBC and GCM modes. Valid values: None or "". 16-byte array for CBC mode. 12-byte array for GCM mode.
aad: Optional additional authenticated data. Only supported for GCM mode. This can be any free-form input and must be provided for both encryption and decryption.

Since: 3.5.0

def aggregate(expr: Column, initialValue: Column, merge: (Column, Column) ⇒ Column): Column

Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.

df.select(aggregate(col("i"), lit(0), (acc, x) => acc + x))

expr: the input array column
initialValue: the initial value
merge: (combined_value, input_value) => combined_value, the merge function to merge an input value to the combined_value

Since: 3.0.0

def aggregate(expr: Column, initialValue: Column, merge: (Column, Column) ⇒ Column, finish: (Column) ⇒ Column): Column

Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.

Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function.

df.select(aggregate(col("i"), lit(0), (acc, x) => acc + x, _ * 10))

expr: the input array column
initialValue: the initial value
merge: (combined_value, input_value) => combined_value, the merge function to merge an input value to the combined_value
finish: combined_value => final_value, the lambda function to convert the combined value of all inputs to final result

Since: 3.0.0

def any(e: Column): Column

Aggregate function: returns true if at least one value of e is true.

Since: 3.5.0

def any_value(e: Column, ignoreNulls: Column): Column

Aggregate function: returns some value of e for a group of rows.

Aggregate function: returns some value of e for a group of rows. If isIgnoreNull is true, returns only non-null values.

Since: 3.5.0

def any_value(e: Column): Column

Aggregate function: returns some value of e for a group of rows.

Since: 3.5.0

def approx_count_distinct(columnName: String, rsd: Double): Column

Aggregate function: returns the approximate number of distinct items in a group.

rsd: maximum relative standard deviation allowed (default = 0.05)

Since: 2.1.0

def approx_count_distinct(e: Column, rsd: Double): Column

Aggregate function: returns the approximate number of distinct items in a group.

rsd: maximum relative standard deviation allowed (default = 0.05)

Since: 2.1.0

def approx_count_distinct(columnName: String): Column

Aggregate function: returns the approximate number of distinct items in a group.

Since: 2.1.0

def approx_count_distinct(e: Column): Column

Aggregate function: returns the approximate number of distinct items in a group.

Since: 2.1.0

def approx_percentile(e: Column, percentage: Column, accuracy: Column): Column

Aggregate function: returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value.

If percentage is an array, each value must be between 0.0 and 1.0. If it is a single floating point value, it must be between 0.0 and 1.0.

The accuracy parameter is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation.

Since: 3.5.0

def array(colName: String, colNames: String*): Column

Creates a new array column.

Creates a new array column. The input columns must all have the same data type.

Annotations: @varargs()
Since: 1.4.0

def array(cols: Column*): Column

Creates a new array column.

Creates a new array column. The input columns must all have the same data type.

Annotations: @varargs()
Since: 1.4.0

def array_agg(e: Column): Column

Aggregate function: returns a list of objects with duplicates.

Since: 3.5.0
Note: The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

def array_append(column: Column, element: Any): Column

Returns an ARRAY containing all elements from the source ARRAY as well as the new element.

Returns an ARRAY containing all elements from the source ARRAY as well as the new element. The new element/column is located at end of the ARRAY.

Since: 3.4.0

def array_compact(column: Column): Column

Remove all null elements from the given array.

Since: 3.4.0

def array_contains(column: Column, value: Any): Column

Returns null if the array is null, true if the array contains value, and false otherwise.

Since: 1.5.0

def array_distinct(e: Column): Column

Removes duplicate values from the array.

Since: 2.4.0

def array_except(col1: Column, col2: Column): Column

Returns an array of the elements in the first array but not in the second array, without duplicates.

Returns an array of the elements in the first array but not in the second array, without duplicates. The order of elements in the result is not determined

Since: 2.4.0

def array_insert(arr: Column, pos: Column, value: Column): Column

Adds an item into a given array at a specified position

Since: 3.4.0

def array_intersect(col1: Column, col2: Column): Column

Returns an array of the elements in the intersection of the given two arrays, without duplicates.

Since: 2.4.0

def array_join(column: Column, delimiter: String): Column

Concatenates the elements of column using the delimiter.

Since: 2.4.0

def array_join(column: Column, delimiter: String, nullReplacement: String): Column

Concatenates the elements of column using the delimiter.

Concatenates the elements of column using the delimiter. Null values are replaced with nullReplacement.

Since: 2.4.0

def array_max(e: Column): Column

Returns the maximum value in the array.

Returns the maximum value in the array. NaN is greater than any non-NaN elements for double/float type. NULL elements are skipped.

Since: 2.4.0

def array_min(e: Column): Column

Returns the minimum value in the array.

Returns the minimum value in the array. NaN is greater than any non-NaN elements for double/float type. NULL elements are skipped.

Since: 2.4.0

def array_position(column: Column, value: Any): Column

Locates the position of the first occurrence of the value in the given array as long.

Locates the position of the first occurrence of the value in the given array as long. Returns null if either of the arguments are null.

Since: 2.4.0
Note: The position is not zero based, but 1 based index. Returns 0 if value could not be found in array.

def array_prepend(column: Column, element: Any): Column

Returns an array containing value as well as all elements from array.

Returns an array containing value as well as all elements from array. The new element is positioned at the beginning of the array.

Since: 3.5.0

def array_remove(column: Column, element: Any): Column

Remove all elements that equal to element from the given array.

Since: 2.4.0

def array_repeat(e: Column, count: Int): Column

Creates an array containing the left argument repeated the number of times given by the right argument.

Since: 2.4.0

def array_repeat(left: Column, right: Column): Column

Creates an array containing the left argument repeated the number of times given by the right argument.

Since: 2.4.0

def array_size(e: Column): Column

Returns the total number of elements in the array.

Returns the total number of elements in the array. The function returns null for null input.

Since: 3.5.0

def array_sort(e: Column, comparator: (Column, Column) ⇒ Column): Column

Sorts the input array based on the given comparator function.

Sorts the input array based on the given comparator function. The comparator will take two arguments representing two elements of the array. It returns a negative integer, 0, or a positive integer as the first element is less than, equal to, or greater than the second element. If the comparator function returns null, the function will fail and raise an error.

Since: 3.4.0

def array_sort(e: Column): Column

Sorts the input array in ascending order.

Sorts the input array in ascending order. The elements of the input array must be orderable. NaN is greater than any non-NaN elements for double/float type. Null elements will be placed at the end of the returned array.

Since: 2.4.0

def array_union(col1: Column, col2: Column): Column

Returns an array of the elements in the union of the given two arrays, without duplicates.

Since: 2.4.0

def arrays_overlap(a1: Column, a2: Column): Column

Returns true if a1 and a2 have at least one non-null element in common.

Returns true if a1 and a2 have at least one non-null element in common. If not and both the arrays are non-empty and any of them contains a null, it returns null. It returns false otherwise.

Since: 2.4.0

def arrays_zip(e: Column*): Column

Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.

Annotations: @varargs()
Since: 2.4.0

final def asInstanceOf[T0]: T0

Definition Classes: Any

def asc(columnName: String): Column

Returns a sort expression based on ascending order of the column.

df.sort(asc("dept"), desc("age"))

Since: 1.3.0

def asc_nulls_first(columnName: String): Column

Returns a sort expression based on ascending order of the column, and null values return before non-null values.

df.sort(asc_nulls_first("dept"), desc("age"))

Since: 2.1.0

def asc_nulls_last(columnName: String): Column

Returns a sort expression based on ascending order of the column, and null values appear after non-null values.

df.sort(asc_nulls_last("dept"), desc("age"))

Since: 2.1.0

def ascii(e: Column): Column

Computes the numeric value of the first character of the string column, and returns the result as an int column.

Since: 1.5.0

def asin(columnName: String): Column

returns: inverse sine of columnName, as if computed by java.lang.Math.asin

Since: 1.4.0

def asin(e: Column): Column

returns: inverse sine of e in radians, as if computed by java.lang.Math.asin

Since: 1.4.0

def asinh(columnName: String): Column

returns: inverse hyperbolic sine of columnName

Since: 3.1.0

def asinh(e: Column): Column

returns: inverse hyperbolic sine of e

Since: 3.1.0

def assert_true(c: Column, e: Column): Column

Returns null if the condition is true; throws an exception with the error message otherwise.

Since: 3.1.0

def assert_true(c: Column): Column

Returns null if the condition is true, and throws an exception otherwise.

Since: 3.1.0

def atan(columnName: String): Column

returns: inverse tangent of columnName, as if computed by java.lang.Math.atan

Since: 1.4.0

def atan(e: Column): Column

returns: inverse tangent of e as if computed by java.lang.Math.atan

Since: 1.4.0

def atan2(yValue: Double, xName: String): Column

yValue: coordinate on y-axis
xName: coordinate on x-axis
returns: the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2

Since: 1.4.0

def atan2(yValue: Double, x: Column): Column

yValue: coordinate on y-axis
x: coordinate on x-axis
returns: the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2

Since: 1.4.0

def atan2(yName: String, xValue: Double): Column

yName: coordinate on y-axis
xValue: coordinate on x-axis
returns: the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2

Since: 1.4.0

def atan2(y: Column, xValue: Double): Column

y: coordinate on y-axis
xValue: coordinate on x-axis
returns: the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2

Since: 1.4.0

def atan2(yName: String, xName: String): Column

yName: coordinate on y-axis
xName: coordinate on x-axis
returns: the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2

Since: 1.4.0

def atan2(yName: String, x: Column): Column

yName: coordinate on y-axis
x: coordinate on x-axis
returns: the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2

Since: 1.4.0

def atan2(y: Column, xName: String): Column

y: coordinate on y-axis
xName: coordinate on x-axis
returns: the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2

Since: 1.4.0

def atan2(y: Column, x: Column): Column

y: coordinate on y-axis
x: coordinate on x-axis
returns: the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2

Since: 1.4.0

def atanh(columnName: String): Column

returns: inverse hyperbolic tangent of columnName

Since: 3.1.0

def atanh(e: Column): Column

returns: inverse hyperbolic tangent of e

Since: 3.1.0

def avg(columnName: String): Column

Aggregate function: returns the average of the values in a group.

Since: 1.3.0

def avg(e: Column): Column

Aggregate function: returns the average of the values in a group.

Since: 1.3.0

def base64(e: Column): Column

Computes the BASE64 encoding of a binary column and returns it as a string column.

Computes the BASE64 encoding of a binary column and returns it as a string column. This is the reverse of unbase64.

Since: 1.5.0

def bin(columnName: String): Column

An expression that returns the string representation of the binary value of the given long column.

An expression that returns the string representation of the binary value of the given long column. For example, bin("12") returns "1100".

Since: 1.5.0

def bin(e: Column): Column

An expression that returns the string representation of the binary value of the given long column.

An expression that returns the string representation of the binary value of the given long column. For example, bin("12") returns "1100".

Since: 1.5.0

def bit_and(e: Column): Column

Aggregate function: returns the bitwise AND of all non-null input values, or null if none.

Since: 3.5.0

def bit_count(e: Column): Column

Returns the number of bits that are set in the argument expr as an unsigned 64-bit integer, or NULL if the argument is NULL.

Since: 3.5.0

def bit_get(e: Column, pos: Column): Column

Returns the value of the bit (0 or 1) at the specified position.

Returns the value of the bit (0 or 1) at the specified position. The positions are numbered from right to left, starting at zero. The position argument cannot be negative.

Since: 3.5.0

def bit_length(e: Column): Column

Calculates the bit length for the specified string column.

Since: 3.3.0

def bit_or(e: Column): Column

Aggregate function: returns the bitwise OR of all non-null input values, or null if none.

Since: 3.5.0

def bit_xor(e: Column): Column

Aggregate function: returns the bitwise XOR of all non-null input values, or null if none.

Since: 3.5.0

def bitmap_bit_position(col: Column): Column

Returns the bit position for the given input column.

Since: 3.5.0

def bitmap_bucket_number(col: Column): Column

Returns the bucket number for the given input column.

Since: 3.5.0

def bitmap_construct_agg(col: Column): Column

Returns a bitmap with the positions of the bits set from all the values from the input column.

Returns a bitmap with the positions of the bits set from all the values from the input column. The input column will most likely be bitmap_bit_position().

Since: 3.5.0

def bitmap_count(col: Column): Column

Returns the number of set bits in the input bitmap.

Since: 3.5.0

def bitmap_or_agg(col: Column): Column

Returns a bitmap that is the bitwise OR of all of the bitmaps from the input column.

Returns a bitmap that is the bitwise OR of all of the bitmaps from the input column. The input column should be bitmaps created from bitmap_construct_agg().

Since: 3.5.0

def bitwise_not(e: Column): Column

Computes bitwise NOT (~) of a number.

Since: 3.2.0

def bool_and(e: Column): Column

Aggregate function: returns true if all values of e are true.

Since: 3.5.0

def bool_or(e: Column): Column

Aggregate function: returns true if at least one value of e is true.

Since: 3.5.0

def broadcast[T](df: Dataset[T]): Dataset[T]

Marks a DataFrame as small enough for use in broadcast joins.

The following example marks the right DataFrame for broadcast hash join using joinKey.

// left and right are DataFrames
left.join(broadcast(right), "joinKey")

Since: 1.5.0

def bround(e: Column, scale: Int): Column

Round the value of e to scale decimal places with HALF_EVEN round mode if scale is greater than or equal to 0 or at integral part when scale is less than 0.

Since: 2.0.0

def bround(e: Column): Column

Returns the value of the column e rounded to 0 decimal places with HALF_EVEN round mode.

Since: 2.0.0

def btrim(str: Column, trim: Column): Column

Remove the leading and trailing trim characters from str.

Since: 3.5.0

def btrim(str: Column): Column

Removes the leading and trailing space characters from str.

Since: 3.5.0

def bucket(numBuckets: Int, e: Column): Column

A transform for any type that partitions by a hash of the input column.

Since: 3.0.0

def bucket(numBuckets: Column, e: Column): Column

A transform for any type that partitions by a hash of the input column.

Since: 3.0.0

def call_function(funcName: String, cols: Column*): Column

Call a SQL function.

funcName: function name that follows the SQL identifier syntax (can be quoted, can be qualified)
cols: the expression parameters of function

Annotations: @varargs()
Since: 3.5.0

def call_udf(udfName: String, cols: Column*): Column

Call an user-defined function.

Call an user-defined function. Example:

import org.apache.spark.sql._

val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
val spark = df.sparkSession
spark.udf.register("simpleUDF", (v: Int) => v * v)
df.select($"id", call_udf("simpleUDF", $"value"))

Annotations: @varargs()
Since: 3.2.0

def cardinality(e: Column): Column

Returns length of array or map.

Returns length of array or map. This is an alias of size function.

The function returns null for null input if spark.sql.legacy.sizeOfNull is set to false or spark.sql.ansi.enabled is set to true. Otherwise, the function returns -1 for null input. With the default settings, the function returns -1 for null input.

Since: 3.5.0

def cbrt(columnName: String): Column

Computes the cube-root of the given column.

Since: 1.4.0

def cbrt(e: Column): Column

Computes the cube-root of the given value.

Since: 1.4.0

def ceil(columnName: String): Column

Computes the ceiling of the given value of e to 0 decimal places.

Since: 1.4.0

def ceil(e: Column): Column

Computes the ceiling of the given value of e to 0 decimal places.

Since: 1.4.0

def ceil(e: Column, scale: Column): Column

Computes the ceiling of the given value of e to scale decimal places.

Since: 3.3.0

def ceiling(e: Column): Column

Computes the ceiling of the given value of e to 0 decimal places.

Since: 3.5.0

def ceiling(e: Column, scale: Column): Column

Computes the ceiling of the given value of e to scale decimal places.

Since: 3.5.0

def char(n: Column): Column

Returns the ASCII character having the binary equivalent to n.

Returns the ASCII character having the binary equivalent to n. If n is larger than 256 the result is equivalent to char(n % 256)

Since: 3.5.0

def char_length(str: Column): Column

Returns the character length of string data or number of bytes of binary data.

Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

Since: 3.5.0

def character_length(str: Column): Column

Returns the character length of string data or number of bytes of binary data.

Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

Since: 3.5.0

def chr(n: Column): Column

Returns the ASCII character having the binary equivalent to n.

Returns the ASCII character having the binary equivalent to n. If n is larger than 256 the result is equivalent to chr(n % 256)

Since: 3.5.0

def clone(): AnyRef

Attributes: protected[lang]
Definition Classes: AnyRef
Annotations: @throws( ... ) @native()

def coalesce(e: Column*): Column

Returns the first column that is not null, or null if all inputs are null.

For example, coalesce(a, b, c) will return a if a is not null, or b if a is null and b is not null, or c if both a and b are null but c is not null.

Annotations: @varargs()
Since: 1.3.0

def col(colName: String): Column

Returns a Column based on the given column name.

Since: 1.3.0

def collect_list(columnName: String): Column

Aggregate function: returns a list of objects with duplicates.

Since: 1.6.0
Note: The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

def collect_list(e: Column): Column

Aggregate function: returns a list of objects with duplicates.

Since: 1.6.0
Note: The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

def collect_set(columnName: String): Column

Aggregate function: returns a set of objects with duplicate elements eliminated.

Since: 1.6.0
Note: The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

def collect_set(e: Column): Column

Aggregate function: returns a set of objects with duplicate elements eliminated.

Since: 1.6.0
Note: The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

def column(colName: String): Column

Returns a Column based on the given column name.

Returns a Column based on the given column name. Alias of col.

Since: 1.3.0

def concat(exprs: Column*): Column

Concatenates multiple input columns together into a single column.

Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.

Annotations: @varargs()
Since: 1.5.0
Note: Returns null if any of the input columns are null.

def concat_ws(sep: String, exprs: Column*): Column

Concatenates multiple input string columns together into a single string column, using the given separator.

Annotations: @varargs()
Since: 1.5.0
Note: Input strings which are null are skipped.

def contains(left: Column, right: Column): Column

Returns a boolean.

Returns a boolean. The value is True if right is found inside left. Returns NULL if either input expression is NULL. Otherwise, returns False. Both left or right must be of STRING or BINARY type.

Since: 3.5.0

def conv(num: Column, fromBase: Int, toBase: Int): Column

Convert a number in a string column from one base to another.

Since: 1.5.0

def convert_timezone(targetTz: Column, sourceTs: Column): Column

Converts the timestamp without time zone sourceTs from the current time zone to targetTz.

targetTz: the time zone to which the input timestamp should be converted.
sourceTs: a timestamp without time zone.

Since: 3.5.0

def convert_timezone(sourceTz: Column, targetTz: Column, sourceTs: Column): Column

Converts the timestamp without time zone sourceTs from the sourceTz time zone to targetTz.

sourceTz: the time zone for the input timestamp. If it is missed, the current session time zone is used as the source time zone.
targetTz: the time zone to which the input timestamp should be converted.
sourceTs: a timestamp without time zone.

Since: 3.5.0

def corr(columnName1: String, columnName2: String): Column

Aggregate function: returns the Pearson Correlation Coefficient for two columns.

Since: 1.6.0

def corr(column1: Column, column2: Column): Column

Aggregate function: returns the Pearson Correlation Coefficient for two columns.

Since: 1.6.0

def cos(columnName: String): Column

columnName: angle in radians
returns: cosine of the angle, as if computed by java.lang.Math.cos

Since: 1.4.0

def cos(e: Column): Column

e: angle in radians
returns: cosine of the angle, as if computed by java.lang.Math.cos

Since: 1.4.0

def cosh(columnName: String): Column

columnName: hyperbolic angle
returns: hyperbolic cosine of the angle, as if computed by java.lang.Math.cosh

Since: 1.4.0

def cosh(e: Column): Column

e: hyperbolic angle
returns: hyperbolic cosine of the angle, as if computed by java.lang.Math.cosh

Since: 1.4.0

def cot(e: Column): Column

e: angle in radians
returns: cotangent of the angle

Since: 3.3.0

def count(columnName: String): TypedColumn[Any, Long]

Aggregate function: returns the number of items in a group.

Since: 1.3.0

def count(e: Column): Column

Aggregate function: returns the number of items in a group.

Since: 1.3.0

def countDistinct(columnName: String, columnNames: String*): Column

Aggregate function: returns the number of distinct items in a group.

An alias of count_distinct, and it is encouraged to use count_distinct directly.

Annotations: @varargs()
Since: 1.3.0

def countDistinct(expr: Column, exprs: Column*): Column

Aggregate function: returns the number of distinct items in a group.

An alias of count_distinct, and it is encouraged to use count_distinct directly.

Annotations: @varargs()
Since: 1.3.0

def count_distinct(expr: Column, exprs: Column*): Column

Aggregate function: returns the number of distinct items in a group.

Annotations: @varargs()
Since: 3.2.0

def count_if(e: Column): Column

Aggregate function: returns the number of TRUE values for the expression.

Since: 3.5.0

def count_min_sketch(e: Column, eps: Column, confidence: Column, seed: Column): Column

Returns a count-min sketch of a column with the given esp, confidence and seed.

Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.

Since: 3.5.0

def covar_pop(columnName1: String, columnName2: String): Column

Aggregate function: returns the population covariance for two columns.

Since: 2.0.0

def covar_pop(column1: Column, column2: Column): Column

Aggregate function: returns the population covariance for two columns.

Since: 2.0.0

def covar_samp(columnName1: String, columnName2: String): Column

Aggregate function: returns the sample covariance for two columns.

Since: 2.0.0

def covar_samp(column1: Column, column2: Column): Column

Aggregate function: returns the sample covariance for two columns.

Since: 2.0.0

def crc32(e: Column): Column

Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a bigint.

Since: 1.5.0

def csc(e: Column): Column

e: angle in radians
returns: cosecant of the angle

Since: 3.3.0

def cume_dist(): Column

Window function: returns the cumulative distribution of values within a window partition, i.e.

Window function: returns the cumulative distribution of values within a window partition, i.e. the fraction of rows that are below the current row.

N = total number of rows in the partition
cumeDist(x) = number of values before (and including) x / N

Since: 1.6.0

def curdate(): Column

Returns the current date at the start of query evaluation as a date column.

Returns the current date at the start of query evaluation as a date column. All calls of current_date within the same query return the same value.

Since: 3.5.0

def current_catalog(): Column

Returns the current catalog.

Since: 3.5.0

def current_database(): Column

Returns the current database.

Since: 3.5.0

def current_date(): Column

Returns the current date at the start of query evaluation as a date column.

Returns the current date at the start of query evaluation as a date column. All calls of current_date within the same query return the same value.

Since: 1.5.0

def current_schema(): Column

Returns the current schema.

Since: 3.5.0

def current_timestamp(): Column

Returns the current timestamp at the start of query evaluation as a timestamp column.

Returns the current timestamp at the start of query evaluation as a timestamp column. All calls of current_timestamp within the same query return the same value.

Since: 1.5.0

def current_timezone(): Column

Returns the current session local timezone.

Since: 3.5.0

def current_user(): Column

Returns the user name of current execution context.

Since: 3.5.0

def date_add(start: Column, days: Column): Column

Returns the date that is days days after start

start: A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
days: A column of the number of days to add to start, can be negative to subtract days
returns: A date, or null if start was a string that could not be cast to a date

Since: 3.0.0

def date_add(start: Column, days: Int): Column

Returns the date that is days days after start

start: A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
days: The number of days to add to start, can be negative to subtract days
returns: A date, or null if start was a string that could not be cast to a date

Since: 1.5.0

def date_diff(end: Column, start: Column): Column

Returns the number of days from start to end.

Only considers the date part of the input. For example:

dateddiff("2018-01-10 00:00:00", "2018-01-09 23:59:59")
// returns 1

end: A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
start: A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
returns: An integer, or null if either end or start were strings that could not be cast to a date. Negative if end is before start

Since: 3.5.0

def date_format(dateExpr: Column, format: String): Column

Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.

See Datetime Patterns for valid date and time format patterns

dateExpr: A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
format: A pattern dd.MM.yyyy would return a string like 18.03.1993
returns: A string, or null if dateExpr was a string that could not be cast to a timestamp

Since: 1.5.0
Exceptions thrown: IllegalArgumentException if the format pattern is invalid
Note: Use specialized functions like year whenever possible as they benefit from a specialized implementation.

def date_from_unix_date(days: Column): Column

Create date from the number of days since 1970-01-01.

Since: 3.5.0

def date_part(field: Column, source: Column): Column

Extracts a part of the date/timestamp or interval source.

field: selects which part of the source should be extracted, and supported string values are as same as the fields of the equivalent function extract.
source: a date/timestamp or interval column from where field should be extracted.
returns: a part of the date/timestamp or interval source

Since: 3.5.0

def date_sub(start: Column, days: Column): Column

Returns the date that is days days before start

start: A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
days: A column of the number of days to subtract from start, can be negative to add days
returns: A date, or null if start was a string that could not be cast to a date

Since: 3.0.0

def date_sub(start: Column, days: Int): Column

Returns the date that is days days before start

start: A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
days: The number of days to subtract from start, can be negative to add days
returns: A date, or null if start was a string that could not be cast to a date

Since: 1.5.0

def date_trunc(format: String, timestamp: Column): Column

Returns timestamp truncated to the unit specified by the format.

For example, date_trunc("year", "2018-11-19 12:01:19") returns 2018-01-01 00:00:00

timestamp: A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
returns: A timestamp, or null if timestamp was a string that could not be cast to a timestamp or format was an invalid value

Since: 2.3.0

def dateadd(start: Column, days: Column): Column

Returns the date that is days days after start

start: A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
days: A column of the number of days to add to start, can be negative to subtract days
returns: A date, or null if start was a string that could not be cast to a date

Since: 3.5.0

def datediff(end: Column, start: Column): Column

Returns the number of days from start to end.

Only considers the date part of the input. For example:

dateddiff("2018-01-10 00:00:00", "2018-01-09 23:59:59")
// returns 1

end: A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
start: A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
returns: An integer, or null if either end or start were strings that could not be cast to a date. Negative if end is before start

Since: 1.5.0

def datepart(field: Column, source: Column): Column

Extracts a part of the date/timestamp or interval source.

field: selects which part of the source should be extracted, and supported string values are as same as the fields of the equivalent function EXTRACT.
source: a date/timestamp or interval column from where field should be extracted.
returns: a part of the date/timestamp or interval source

Since: 3.5.0

def day(e: Column): Column

Extracts the day of the month as an integer from a given date/timestamp/string.

returns: An integer, or null if the input was a string that could not be cast to a date

Since: 3.5.0

def dayofmonth(e: Column): Column

Extracts the day of the month as an integer from a given date/timestamp/string.

returns: An integer, or null if the input was a string that could not be cast to a date

Since: 1.5.0

def dayofweek(e: Column): Column

Extracts the day of the week as an integer from a given date/timestamp/string.

Extracts the day of the week as an integer from a given date/timestamp/string. Ranges from 1 for a Sunday through to 7 for a Saturday

returns: An integer, or null if the input was a string that could not be cast to a date

Since: 2.3.0

def dayofyear(e: Column): Column

Extracts the day of the year as an integer from a given date/timestamp/string.

returns: An integer, or null if the input was a string that could not be cast to a date

Since: 1.5.0

def days(e: Column): Column

A transform for timestamps and dates to partition data into days.

Since: 3.0.0

def decode(value: Column, charset: String): Column

Computes the first argument into a string from a binary using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16').

Computes the first argument into a string from a binary using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). If either argument is null, the result will also be null.

Since: 1.5.0

def degrees(columnName: String): Column

Converts an angle measured in radians to an approximately equivalent angle measured in degrees.

columnName: angle in radians
returns: angle in degrees, as if computed by java.lang.Math.toDegrees

Since: 2.1.0

def degrees(e: Column): Column

Converts an angle measured in radians to an approximately equivalent angle measured in degrees.

e: angle in radians
returns: angle in degrees, as if computed by java.lang.Math.toDegrees

Since: 2.1.0

def dense_rank(): Column

Window function: returns the rank of rows within a window partition, without any gaps.

The difference between rank and dense_rank is that denseRank leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using dense_rank and had three people tie for second place, you would say that all three were in second place and that the next person came in third. Rank would give me sequential numbers, making the person that came in third place (after the ties) would register as coming in fifth.

This is equivalent to the DENSE_RANK function in SQL.

Since: 1.6.0

def desc(columnName: String): Column

Returns a sort expression based on the descending order of the column.

df.sort(asc("dept"), desc("age"))

Since: 1.3.0

def desc_nulls_first(columnName: String): Column

Returns a sort expression based on the descending order of the column, and null values appear before non-null values.

df.sort(asc("dept"), desc_nulls_first("age"))

Since: 2.1.0

def desc_nulls_last(columnName: String): Column

Returns a sort expression based on the descending order of the column, and null values appear after non-null values.

df.sort(asc("dept"), desc_nulls_last("age"))

Since: 2.1.0

def e(): Column

Returns Euler's number.

Since: 3.5.0

def element_at(column: Column, value: Any): Column

Returns element of array at given index in value if column is array.

Returns element of array at given index in value if column is array. Returns value for the given key in value if column is map.

Since: 2.4.0

def elt(inputs: Column*): Column

Returns the n-th input, e.g., returns input2 when n is 2.

Returns the n-th input, e.g., returns input2 when n is 2. The function returns NULL if the index exceeds the length of the array and spark.sql.ansi.enabled is set to false. If spark.sql.ansi.enabled is set to true, it throws ArrayIndexOutOfBoundsException for invalid indices.

Annotations: @varargs()
Since: 3.5.0

def encode(value: Column, charset: String): Column

Computes the first argument into a binary from a string using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16').

Computes the first argument into a binary from a string using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). If either argument is null, the result will also be null.

Since: 1.5.0

def endswith(str: Column, suffix: Column): Column

Returns a boolean.

Returns a boolean. The value is True if str ends with suffix. Returns NULL if either input expression is NULL. Otherwise, returns False. Both str or suffix must be of STRING or BINARY type.

Since: 3.5.0

final def eq(arg0: AnyRef): Boolean

Definition Classes: AnyRef

def equal_null(col1: Column, col2: Column): Column

Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both are null, false if one of the them is null.

Since: 3.5.0

def equals(arg0: Any): Boolean

Definition Classes: AnyRef → Any

def every(e: Column): Column

Aggregate function: returns true if all values of e are true.

Since: 3.5.0

def exists(column: Column, f: (Column) ⇒ Column): Column

Returns whether a predicate holds for one or more elements in the array.

df.select(exists(col("i"), _ % 2 === 0))

column: the input array column
f: col => predicate, the Boolean predicate to check the input column

Since: 3.0.0

def exp(columnName: String): Column

Computes the exponential of the given column.

Since: 1.4.0

def exp(e: Column): Column

Computes the exponential of the given value.

Since: 1.4.0

def explode(e: Column): Column

Creates a new row for each element in the given array or map column.

Creates a new row for each element in the given array or map column. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise.

Since: 1.3.0

def explode_outer(e: Column): Column

Creates a new row for each element in the given array or map column.

Creates a new row for each element in the given array or map column. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise. Unlike explode, if the array/map is null or empty then null is produced.

Since: 2.2.0

def expm1(columnName: String): Column

Computes the exponential of the given column minus one.

Since: 1.4.0

def expm1(e: Column): Column

Computes the exponential of the given value minus one.

Since: 1.4.0

def expr(expr: String): Column

Parses the expression string into the column that it represents, similar to Dataset#selectExpr.

// get the number of words of each length
df.groupBy(expr("length(word)")).count()

def extract(field: Column, source: Column): Column

Extracts a part of the date/timestamp or interval source.

field: selects which part of the source should be extracted.
source: a date/timestamp or interval column from where field should be extracted.
returns: a part of the date/timestamp or interval source

Since: 3.5.0

def factorial(e: Column): Column

Computes the factorial of the given value.

Since: 1.5.0

def filter(column: Column, f: (Column, Column) ⇒ Column): Column

Returns an array of elements for which a predicate holds in a given array.

df.select(filter(col("s"), (x, i) => i % 2 === 0))

column: the input array column
f: (col, index) => predicate, the Boolean predicate to filter the input column given the index. Indices start at 0.

Since: 3.0.0

def filter(column: Column, f: (Column) ⇒ Column): Column

Returns an array of elements for which a predicate holds in a given array.

df.select(filter(col("s"), x => x % 2 === 0))

column: the input array column
f: col => predicate, the Boolean predicate to filter the input column

Since: 3.0.0

def finalize(): Unit

Attributes: protected[lang]
Definition Classes: AnyRef
Annotations: @throws( classOf[java.lang.Throwable] )

def find_in_set(str: Column, strArray: Column): Column

Returns the index (1-based) of the given string (str) in the comma-delimited list (strArray).

Returns the index (1-based) of the given string (str) in the comma-delimited list (strArray). Returns 0, if the string was not found or if the given string (str) contains a comma.

Since: 3.5.0

def first(columnName: String): Column

Aggregate function: returns the first value of a column in a group.

The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

Since: 1.3.0
Note: The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

def first(e: Column): Column

Aggregate function: returns the first value in a group.

The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

Since: 1.3.0
Note: The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

def first(columnName: String, ignoreNulls: Boolean): Column

Aggregate function: returns the first value of a column in a group.

The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

Since: 2.0.0
Note: The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

def first(e: Column, ignoreNulls: Boolean): Column

Aggregate function: returns the first value in a group.

The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

Since: 2.0.0
Note: The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

def first_value(e: Column, ignoreNulls: Column): Column

Aggregate function: returns the first value in a group.

The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

Since: 3.5.0
Note: The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

def first_value(e: Column): Column

Aggregate function: returns the first value in a group.

Since: 3.5.0
Note: The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

def flatten(e: Column): Column

Creates a single array from an array of arrays.

Creates a single array from an array of arrays. If a structure of nested arrays is deeper than two levels, only one level of nesting is removed.

Since: 2.4.0

def floor(columnName: String): Column

Computes the floor of the given column value to 0 decimal places.

Since: 1.4.0

def floor(e: Column): Column

Computes the floor of the given value of e to 0 decimal places.

Since: 1.4.0

def floor(e: Column, scale: Column): Column

Computes the floor of the given value of e to scale decimal places.

Since: 3.3.0

def forall(column: Column, f: (Column) ⇒ Column): Column

Returns whether a predicate holds for every element in the array.

df.select(forall(col("i"), x => x % 2 === 0))

column: the input array column
f: col => predicate, the Boolean predicate to check the input column

Since: 3.0.0

def format_number(x: Column, d: Int): Column

Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column.

If d is 0, the result has no decimal point or fractional part. If d is less than 0, the result will be null.

Since: 1.5.0

def format_string(format: String, arguments: Column*): Column

Formats the arguments in printf-style and returns the result as a string column.

Annotations: @varargs()
Since: 1.5.0

def from_csv(e: Column, schema: Column, options: Map[String, String]): Column

(Java-specific) Parses a column containing a CSV string into a StructType with the specified schema.

(Java-specific) Parses a column containing a CSV string into a StructType with the specified schema. Returns null, in the case of an unparseable string.

e: a string column containing CSV data.
schema: the schema to use when parsing the CSV string
options: options to control how the CSV is parsed. accepts the same options and the CSV data source. See Data Source Option in the version you use.

Since: 3.0.0

def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column

Parses a column containing a CSV string into a StructType with the specified schema.

Parses a column containing a CSV string into a StructType with the specified schema. Returns null, in the case of an unparseable string.

e: a string column containing CSV data.
schema: the schema to use when parsing the CSV string
options: options to control how the CSV is parsed. accepts the same options and the CSV data source. See Data Source Option in the version you use.

Since: 3.0.0

def from_json(e: Column, schema: Column, options: Map[String, String]): Column

(Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType of StructTypes with the specified schema.

(Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType of StructTypes with the specified schema. Returns null, in the case of an unparseable string.

e: a string column containing JSON data.
schema: the schema to use when parsing the json string
options: options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.

Since: 2.4.0

def from_json(e: Column, schema: Column): Column

(Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType of StructTypes with the specified schema.

(Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType of StructTypes with the specified schema. Returns null, in the case of an unparseable string.

e: a string column containing JSON data.
schema: the schema to use when parsing the json string

Since: 2.4.0

def from_json(e: Column, schema: String, options: Map[String, String]): Column

(Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.

(Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string.

e: a string column containing JSON data.
schema: the schema as a DDL-formatted string.
options: options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.

Since: 2.3.0

def from_json(e: Column, schema: String, options: Map[String, String]): Column

(Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.

(Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string.

e: a string column containing JSON data.
schema: the schema as a DDL-formatted string.
options: options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.

Since: 2.1.0

def from_json(e: Column, schema: DataType): Column

Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.

Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string.

e: a string column containing JSON data.
schema: the schema to use when parsing the json string

Since: 2.2.0

def from_json(e: Column, schema: StructType): Column

Parses a column containing a JSON string into a StructType with the specified schema.

Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.

e: a string column containing JSON data.
schema: the schema to use when parsing the json string

Since: 2.1.0

def from_json(e: Column, schema: DataType, options: Map[String, String]): Column

(Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.

(Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string.

e: a string column containing JSON data.
schema: the schema to use when parsing the json string
options: options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.

Since: 2.2.0

def from_json(e: Column, schema: StructType, options: Map[String, String]): Column

(Java-specific) Parses a column containing a JSON string into a StructType with the specified schema.

(Java-specific) Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.

e: a string column containing JSON data.
schema: the schema to use when parsing the json string
options: options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.

Since: 2.1.0

def from_json(e: Column, schema: DataType, options: Map[String, String]): Column

(Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.

(Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string.

e: a string column containing JSON data.
schema: the schema to use when parsing the json string
options: options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.

Since: 2.2.0

def from_json(e: Column, schema: StructType, options: Map[String, String]): Column

(Scala-specific) Parses a column containing a JSON string into a StructType with the specified schema.

(Scala-specific) Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.

e: a string column containing JSON data.
schema: the schema to use when parsing the json string
options: options to control how the json is parsed. Accepts the same options as the json data source. See Data Source Option in the version you use.

Since: 2.1.0

def from_unixtime(ut: Column, f: String): Column

Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.

See Datetime Patterns for valid date and time format patterns

ut: A number of a type that is castable to a long, such as string or integer. Can be negative for timestamps before the unix epoch
f: A date time pattern that the input will be formatted to
returns: A string, or null if ut was a string that could not be cast to a long or f was an invalid date time pattern

Since: 1.5.0

def from_unixtime(ut: Column): Column

Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format.

ut: A number of a type that is castable to a long, such as string or integer. Can be negative for timestamps before the unix epoch
returns: A string, or null if the input was a string that could not be cast to a long

Since: 1.5.0

def from_utc_timestamp(ts: Column, tz: Column): Column

Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone.

Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, 'GMT+1' would yield '2017-07-14 03:40:00.0'.

Since: 2.4.0

def from_utc_timestamp(ts: Column, tz: String): Column

Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone.

Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, 'GMT+1' would yield '2017-07-14 03:40:00.0'.

ts: A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
tz: A string detailing the time zone ID that the input should be adjusted to. It should be in the format of either region-based zone IDs or zone offsets. Region IDs must have the form 'area/city', such as 'America/Los_Angeles'. Zone offsets must be in the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'. Other short names are not recommended to use because they can be ambiguous.
returns: A timestamp, or null if ts was a string that could not be cast to a timestamp or tz was an invalid value

Since: 1.5.0

def get(column: Column, index: Column): Column

Returns element of array at given (0-based) index.

Returns element of array at given (0-based) index. If the index points outside of the array boundaries, then this function returns NULL.

Since: 3.4.0

final def getClass(): Class[_]

Definition Classes: AnyRef → Any
Annotations: @native()

def get_json_object(e: Column, path: String): Column

Extracts json object from a json string based on json path specified, and returns json string of the extracted json object.

Extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid.

Since: 1.6.0

def getbit(e: Column, pos: Column): Column

Returns the value of the bit (0 or 1) at the specified position.

Returns the value of the bit (0 or 1) at the specified position. The positions are numbered from right to left, starting at zero. The position argument cannot be negative.

Since: 3.5.0

def greatest(columnName: String, columnNames: String*): Column

Returns the greatest value of the list of column names, skipping null values.

Returns the greatest value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.

Annotations: @varargs()
Since: 1.5.0

def greatest(exprs: Column*): Column

Returns the greatest value of the list of values, skipping null values.

Returns the greatest value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.

Annotations: @varargs()
Since: 1.5.0

def grouping(columnName: String): Column

Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.

Since: 2.0.0

def grouping(e: Column): Column

Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.

Since: 2.0.0

def grouping_id(colName: String, colNames: String*): Column

Aggregate function: returns the level of grouping, equals to

(grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn)

Since: 2.0.0
Note: The list of columns should match with grouping columns exactly.

def grouping_id(cols: Column*): Column

Aggregate function: returns the level of grouping, equals to

(grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn)

Since: 2.0.0
Note: The list of columns should match with grouping columns exactly, or empty (means all the grouping columns).

def hash(cols: Column*): Column

Calculates the hash code of given columns, and returns the result as an int column.

Annotations: @varargs()
Since: 2.0.0

def hashCode(): Int

Definition Classes: AnyRef → Any
Annotations: @native()

def hex(column: Column): Column

Computes hex value of the given column.

Since: 1.5.0

def histogram_numeric(e: Column, nBins: Column): Column

Aggregate function: computes a histogram on numeric 'expr' using nb bins.

Aggregate function: computes a histogram on numeric 'expr' using nb bins. The return value is an array of (x,y) pairs representing the centers of the histogram's bins. As the value of 'nb' is increased, the histogram approximation gets finer-grained, but may yield artifacts around outliers. In practice, 20-40 histogram bins appear to work well, with more bins being required for skewed or smaller datasets. Note that this function creates a histogram with non-uniform bin widths. It offers no guarantees in terms of the mean-squared-error of the histogram, but in practice is comparable to the histograms produced by the R/S-Plus statistical computing packages. Note: the output type of the 'x' field in the return value is propagated from the input value consumed in the aggregate function.

Since: 3.5.0

def hll_sketch_agg(columnName: String): Column

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with default lgConfigK value.

Since: 3.5.0

def hll_sketch_agg(e: Column): Column

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with default lgConfigK value.

Since: 3.5.0

def hll_sketch_agg(columnName: String, lgConfigK: Int): Column

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with lgConfigK arg.

Since: 3.5.0

def hll_sketch_agg(e: Column, lgConfigK: Int): Column

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with lgConfigK arg.

Since: 3.5.0

def hll_sketch_agg(e: Column, lgConfigK: Column): Column

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with lgConfigK arg.

Since: 3.5.0

def hll_sketch_estimate(columnName: String): Column

Returns the estimated number of unique values given the binary representation of a Datasketches HllSketch.

Since: 3.5.0

def hll_sketch_estimate(c: Column): Column

Returns the estimated number of unique values given the binary representation of a Datasketches HllSketch.

Since: 3.5.0

def hll_union(columnName1: String, columnName2: String, allowDifferentLgConfigK: Boolean): Column

Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object.

Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object. Throws an exception if sketches have different lgConfigK values and allowDifferentLgConfigK is set to false.

Since: 3.5.0

def hll_union(c1: Column, c2: Column, allowDifferentLgConfigK: Boolean): Column

Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object.

Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object. Throws an exception if sketches have different lgConfigK values and allowDifferentLgConfigK is set to false.

Since: 3.5.0

def hll_union(columnName1: String, columnName2: String): Column

Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object.

Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object. Throws an exception if sketches have different lgConfigK values.

Since: 3.5.0

def hll_union(c1: Column, c2: Column): Column

Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object.

Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object. Throws an exception if sketches have different lgConfigK values.

Since: 3.5.0

def hll_union_agg(columnName: String): Column

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance.

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance. Throws an exception if sketches have different lgConfigK values.

Since: 3.5.0

def hll_union_agg(e: Column): Column

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance.

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance. Throws an exception if sketches have different lgConfigK values.

Since: 3.5.0

def hll_union_agg(columnName: String, allowDifferentLgConfigK: Boolean): Column

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance.

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance. Throws an exception if sketches have different lgConfigK values and allowDifferentLgConfigK is set to false.

Since: 3.5.0

def hll_union_agg(e: Column, allowDifferentLgConfigK: Boolean): Column

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance.

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance. Throws an exception if sketches have different lgConfigK values and allowDifferentLgConfigK is set to false.

Since: 3.5.0

def hll_union_agg(e: Column, allowDifferentLgConfigK: Column): Column

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance.

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance. Throws an exception if sketches have different lgConfigK values and allowDifferentLgConfigK is set to false.

Since: 3.5.0

def hour(e: Column): Column

Extracts the hours as an integer from a given date/timestamp/string.

returns: An integer, or null if the input was a string that could not be cast to a date

Since: 1.5.0

def hours(e: Column): Column

A transform for timestamps to partition data into hours.

Since: 3.0.0

def hypot(l: Double, rightName: String): Column

Computes sqrt(a² + b²) without intermediate overflow or underflow.

Since: 1.4.0

def hypot(l: Double, r: Column): Column

Computes sqrt(a² + b²) without intermediate overflow or underflow.

Since: 1.4.0

def hypot(leftName: String, r: Double): Column

Computes sqrt(a² + b²) without intermediate overflow or underflow.

Since: 1.4.0

def hypot(l: Column, r: Double): Column

Computes sqrt(a² + b²) without intermediate overflow or underflow.

Since: 1.4.0

def hypot(leftName: String, rightName: String): Column

Computes sqrt(a² + b²) without intermediate overflow or underflow.

Since: 1.4.0

def hypot(leftName: String, r: Column): Column

Computes sqrt(a² + b²) without intermediate overflow or underflow.

Since: 1.4.0

def hypot(l: Column, rightName: String): Column

Computes sqrt(a² + b²) without intermediate overflow or underflow.

Since: 1.4.0

def hypot(l: Column, r: Column): Column

Computes sqrt(a² + b²) without intermediate overflow or underflow.

Since: 1.4.0

def ifnull(col1: Column, col2: Column): Column

Returns col2 if col1 is null, or col1 otherwise.

Since: 3.5.0

def ilike(str: Column, pattern: Column): Column

Returns true if str matches pattern with escapeChar('\') case-insensitively, null if any arguments are null, false otherwise.

Since: 3.5.0

def ilike(str: Column, pattern: Column, escapeChar: Column): Column

Returns true if str matches pattern with escapeChar case-insensitively, null if any arguments are null, false otherwise.

Since: 3.5.0

def initcap(e: Column): Column

Returns a new string column by converting the first letter of each word to uppercase.

Returns a new string column by converting the first letter of each word to uppercase. Words are delimited by whitespace.

For example, "hello world" will become "Hello World".

Since: 1.5.0

def inline(e: Column): Column

Creates a new row for each element in the given array of structs.

Since: 3.4.0

def inline_outer(e: Column): Column

Creates a new row for each element in the given array of structs.

Creates a new row for each element in the given array of structs. Unlike inline, if the array is null or empty then null is produced for each nested column.

Since: 3.4.0

def input_file_block_length(): Column

Returns the length of the block being read, or -1 if not available.

Since: 3.5.0

def input_file_block_start(): Column

Returns the start offset of the block being read, or -1 if not available.

Since: 3.5.0

def input_file_name(): Column

Creates a string column for the file name of the current Spark task.

Since: 1.6.0

def instr(str: Column, substring: String): Column

Locate the position of the first occurrence of substr column in the given string.

Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.

Since: 1.5.0
Note: The position is not zero based, but 1 based index. Returns 0 if substr could not be found in str.

final def isInstanceOf[T0]: Boolean

Definition Classes: Any

def isnan(e: Column): Column

Return true iff the column is NaN.

Since: 1.6.0

def isnotnull(col: Column): Column

Returns true if col is not null, or false otherwise.

Since: 3.5.0

def isnull(e: Column): Column

Return true iff the column is null.

Since: 1.6.0

def java_method(cols: Column*): Column

Calls a method with reflection.

Since: 3.5.0

def json_array_length(jsonArray: Column): Column

Returns the number of elements in the outermost JSON array.

Returns the number of elements in the outermost JSON array. NULL is returned in case of any other valid JSON string, NULL or an invalid JSON.

Since: 3.5.0

def json_object_keys(json: Column): Column

Returns all the keys of the outermost JSON object as an array.

Returns all the keys of the outermost JSON object as an array. If a valid JSON object is given, all the keys of the outermost object will be returned as an array. If it is any other valid JSON string, an invalid JSON string or an empty string, the function returns null.

Since: 3.5.0

def json_tuple(json: Column, fields: String*): Column

Creates a new row for a json column according to the given field names.

Annotations: @varargs()
Since: 1.6.0

def kurtosis(columnName: String): Column

Aggregate function: returns the kurtosis of the values in a group.

Since: 1.6.0

def kurtosis(e: Column): Column

Aggregate function: returns the kurtosis of the values in a group.

Since: 1.6.0

def lag(e: Column, offset: Int, defaultValue: Any, ignoreNulls: Boolean): Column

Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row.

Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. ignoreNulls determines whether null values of row are included in or eliminated from the calculation. For example, an offset of one will return the previous row at any given point in the window partition.

This is equivalent to the LAG function in SQL.

Since: 3.2.0

def lag(e: Column, offset: Int, defaultValue: Any): Column

Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row.

Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.

This is equivalent to the LAG function in SQL.

Since: 1.4.0

def lag(columnName: String, offset: Int, defaultValue: Any): Column

Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row.

Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.

This is equivalent to the LAG function in SQL.

Since: 1.4.0

def lag(columnName: String, offset: Int): Column

Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row.

Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.

This is equivalent to the LAG function in SQL.

Since: 1.4.0

def lag(e: Column, offset: Int): Column

Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row.

Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.

This is equivalent to the LAG function in SQL.

Since: 1.4.0

def last(columnName: String): Column

Aggregate function: returns the last value of the column in a group.

The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

Since: 1.3.0
Note: The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

def last(e: Column): Column

Aggregate function: returns the last value in a group.

The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

Since: 1.3.0
Note: The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

def last(columnName: String, ignoreNulls: Boolean): Column

Aggregate function: returns the last value of the column in a group.

The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

Since: 2.0.0
Note: The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

def last(e: Column, ignoreNulls: Boolean): Column

Aggregate function: returns the last value in a group.

The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

Since: 2.0.0
Note: The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

def last_day(e: Column): Column

Returns the last day of the month which the given date belongs to.

Returns the last day of the month which the given date belongs to. For example, input "2015-07-27" returns "2015-07-31" since July 31 is the last day of the month in July 2015.

e: A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
returns: A date, or null if the input was a string that could not be cast to a date

Since: 1.5.0

def last_value(e: Column, ignoreNulls: Column): Column

Aggregate function: returns the last value in a group.

The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

Since: 3.5.0
Note: The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

def last_value(e: Column): Column

Aggregate function: returns the last value in a group.

Since: 3.5.0
Note: The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

def lcase(str: Column): Column

Returns str with all characters changed to lowercase.

Since: 3.5.0

def lead(e: Column, offset: Int, defaultValue: Any, ignoreNulls: Boolean): Column

Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row.

Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row. ignoreNulls determines whether null values of row are included in or eliminated from the calculation. The default value of ignoreNulls is false. For example, an offset of one will return the next row at any given point in the window partition.

This is equivalent to the LEAD function in SQL.

Since: 3.2.0

def lead(e: Column, offset: Int, defaultValue: Any): Column

Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row.

Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.

This is equivalent to the LEAD function in SQL.

Since: 1.4.0

def lead(columnName: String, offset: Int, defaultValue: Any): Column

Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row.

Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.

This is equivalent to the LEAD function in SQL.

Since: 1.4.0

def lead(e: Column, offset: Int): Column

Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row.

Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.

This is equivalent to the LEAD function in SQL.

Since: 1.4.0

def lead(columnName: String, offset: Int): Column

Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row.

Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.

This is equivalent to the LEAD function in SQL.

Since: 1.4.0

def least(columnName: String, columnNames: String*): Column

Returns the least value of the list of column names, skipping null values.

Returns the least value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.

Annotations: @varargs()
Since: 1.5.0

def least(exprs: Column*): Column

Returns the least value of the list of values, skipping null values.

Returns the least value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.

Annotations: @varargs()
Since: 1.5.0

def left(str: Column, len: Column): Column

Returns the leftmost len(len can be string type) characters from the string str, if len is less or equal than 0 the result is an empty string.

Since: 3.5.0

def len(e: Column): Column

Computes the character length of a given string or number of bytes of a binary string.

Computes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros.

Since: 3.5.0

def length(e: Column): Column

Computes the character length of a given string or number of bytes of a binary string.

Computes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros.

Since: 1.5.0

def levenshtein(l: Column, r: Column): Column

Computes the Levenshtein distance of the two given string columns.

Since: 1.5.0

def levenshtein(l: Column, r: Column, threshold: Int): Column

Computes the Levenshtein distance of the two given string columns if it's less than or equal to a given threshold.

returns: result distance, or -1

Since: 3.5.0

def like(str: Column, pattern: Column): Column

Returns true if str matches pattern with escapeChar('\'), null if any arguments are null, false otherwise.

Since: 3.5.0

def like(str: Column, pattern: Column, escapeChar: Column): Column

Returns true if str matches pattern with escapeChar, null if any arguments are null, false otherwise.

Since: 3.5.0

def lit(literal: Any): Column

Creates a Column of literal value.

The passed in object is returned directly if it is already a Column. If the object is a Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value.

Since: 1.3.0

def ln(e: Column): Column

Computes the natural logarithm of the given value.

Since: 3.5.0

def localtimestamp(): Column

Returns the current timestamp without time zone at the start of query evaluation as a timestamp without time zone column.

Returns the current timestamp without time zone at the start of query evaluation as a timestamp without time zone column. All calls of localtimestamp within the same query return the same value.

Since: 3.3.0

def locate(substr: String, str: Column, pos: Int): Column

Locate the position of the first occurrence of substr in a string column, after position pos.

Since: 1.5.0
Note: The position is not zero based, but 1 based index. returns 0 if substr could not be found in str.

def locate(substr: String, str: Column): Column

Locate the position of the first occurrence of substr.

Since: 1.5.0
Note: The position is not zero based, but 1 based index. Returns 0 if substr could not be found in str.

def log(base: Double, columnName: String): Column

Returns the first argument-base logarithm of the second argument.

Since: 1.4.0

def log(base: Double, a: Column): Column

Returns the first argument-base logarithm of the second argument.

Since: 1.4.0

def log(columnName: String): Column

Computes the natural logarithm of the given column.

Since: 1.4.0

def log(e: Column): Column

Computes the natural logarithm of the given value.

Since: 1.4.0

def log10(columnName: String): Column

Computes the logarithm of the given value in base 10.

Since: 1.4.0

def log10(e: Column): Column

Computes the logarithm of the given value in base 10.

Since: 1.4.0

def log1p(columnName: String): Column

Computes the natural logarithm of the given column plus one.

Since: 1.4.0

def log1p(e: Column): Column

Computes the natural logarithm of the given value plus one.

Since: 1.4.0

def log2(columnName: String): Column

Computes the logarithm of the given value in base 2.

Since: 1.5.0

def log2(expr: Column): Column

Computes the logarithm of the given column in base 2.

Since: 1.5.0

def lower(e: Column): Column

Converts a string column to lower case.

Since: 1.3.0

def lpad(str: Column, len: Int, pad: Array[Byte]): Column

Left-pad the binary column with pad to a byte length of len.

Left-pad the binary column with pad to a byte length of len. If the binary column is longer than len, the return value is shortened to len bytes.

Since: 3.3.0

def lpad(str: Column, len: Int, pad: String): Column

Left-pad the string column with pad to a length of len.

Left-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.

Since: 1.5.0

def ltrim(e: Column, trimString: String): Column

Trim the specified character string from left end for the specified string column.

Since: 2.3.0

def ltrim(e: Column): Column

Trim the spaces from left end for the specified string value.

Since: 1.5.0

def make_date(year: Column, month: Column, day: Column): Column

returns: A date created from year, month and day fields.

Since: 3.3.0

def make_dt_interval(): Column

Make DayTimeIntervalType duration.

Since: 3.5.0

def make_dt_interval(days: Column): Column

Make DayTimeIntervalType duration from days.

Since: 3.5.0

def make_dt_interval(days: Column, hours: Column): Column

Make DayTimeIntervalType duration from days and hours.

Since: 3.5.0

def make_dt_interval(days: Column, hours: Column, mins: Column): Column

Make DayTimeIntervalType duration from days, hours and mins.

Since: 3.5.0

def make_dt_interval(days: Column, hours: Column, mins: Column, secs: Column): Column

Make DayTimeIntervalType duration from days, hours, mins and secs.

Since: 3.5.0

def make_interval(): Column

Make interval.

Since: 3.5.0

def make_interval(years: Column): Column

Make interval from years.

Since: 3.5.0

def make_interval(years: Column, months: Column): Column

Make interval from years and months.

Since: 3.5.0

def make_interval(years: Column, months: Column, weeks: Column): Column

Make interval from years, months and weeks.

Since: 3.5.0

def make_interval(years: Column, months: Column, weeks: Column, days: Column): Column

Make interval from years, months, weeks and days.

Since: 3.5.0

def make_interval(years: Column, months: Column, weeks: Column, days: Column, hours: Column): Column

Make interval from years, months, weeks, days and hours.

Since: 3.5.0

def make_interval(years: Column, months: Column, weeks: Column, days: Column, hours: Column, mins: Column): Column

Make interval from years, months, weeks, days, hours and mins.

Since: 3.5.0

def make_interval(years: Column, months: Column, weeks: Column, days: Column, hours: Column, mins: Column, secs: Column): Column

Make interval from years, months, weeks, days, hours, mins and secs.

Since: 3.5.0

def make_timestamp(years: Column, months: Column, days: Column, hours: Column, mins: Column, secs: Column): Column

Create timestamp from years, months, days, hours, mins and secs fields.

Create timestamp from years, months, days, hours, mins and secs fields. The result data type is consistent with the value of configuration spark.sql.timestampType. If the configuration spark.sql.ansi.enabled is false, the function returns NULL on invalid inputs. Otherwise, it will throw an error instead.

Since: 3.5.0

def make_timestamp(years: Column, months: Column, days: Column, hours: Column, mins: Column, secs: Column, timezone: Column): Column

Create timestamp from years, months, days, hours, mins, secs and timezone fields.

Create timestamp from years, months, days, hours, mins, secs and timezone fields. The result data type is consistent with the value of configuration spark.sql.timestampType. If the configuration spark.sql.ansi.enabled is false, the function returns NULL on invalid inputs. Otherwise, it will throw an error instead.

Since: 3.5.0

def make_timestamp_ltz(years: Column, months: Column, days: Column, hours: Column, mins: Column, secs: Column): Column

Create the current timestamp with local time zone from years, months, days, hours, mins and secs fields.

Create the current timestamp with local time zone from years, months, days, hours, mins and secs fields. If the configuration spark.sql.ansi.enabled is false, the function returns NULL on invalid inputs. Otherwise, it will throw an error instead.

Since: 3.5.0

def make_timestamp_ltz(years: Column, months: Column, days: Column, hours: Column, mins: Column, secs: Column, timezone: Column): Column

Create the current timestamp with local time zone from years, months, days, hours, mins, secs and timezone fields.

Create the current timestamp with local time zone from years, months, days, hours, mins, secs and timezone fields. If the configuration spark.sql.ansi.enabled is false, the function returns NULL on invalid inputs. Otherwise, it will throw an error instead.

Since: 3.5.0

def make_timestamp_ntz(years: Column, months: Column, days: Column, hours: Column, mins: Column, secs: Column): Column

Create local date-time from years, months, days, hours, mins, secs fields.

Create local date-time from years, months, days, hours, mins, secs fields. If the configuration spark.sql.ansi.enabled is false, the function returns NULL on invalid inputs. Otherwise, it will throw an error instead.

Since: 3.5.0

def make_ym_interval(): Column

Make year-month interval.

Since: 3.5.0

def make_ym_interval(years: Column): Column

Make year-month interval from years.

Since: 3.5.0

def make_ym_interval(years: Column, months: Column): Column

Make year-month interval from years, months.

Since: 3.5.0

def map(cols: Column*): Column

Creates a new map column.

Creates a new map column. The input columns must be grouped as key-value pairs, e.g. (key1, value1, key2, value2, ...). The key columns must all have the same data type, and can't be null. The value columns must all have the same data type.

Annotations: @varargs()
Since: 2.0

def map_concat(cols: Column*): Column

Returns the union of all the given maps.

Annotations: @varargs()
Since: 2.4.0

def map_contains_key(column: Column, key: Any): Column

Returns true if the map contains the key.

Since: 3.3.0

def map_entries(e: Column): Column

Returns an unordered array of all entries in the given map.

Since: 3.0.0

def map_filter(expr: Column, f: (Column, Column) ⇒ Column): Column

Returns a map whose key-value pairs satisfy a predicate.

df.select(map_filter(col("m"), (k, v) => k * 10 === v))

expr: the input map column
f: (key, value) => predicate, the Boolean predicate to filter the input map column

Since: 3.0.0

def map_from_arrays(keys: Column, values: Column): Column

Creates a new map column.

Creates a new map column. The array in the first column is used for keys. The array in the second column is used for values. All elements in the array for key should not be null.

Since: 2.4

def map_from_entries(e: Column): Column

Returns a map created from the given array of entries.

Since: 2.4.0

def map_keys(e: Column): Column

Returns an unordered array containing the keys of the map.

Since: 2.3.0

def map_values(e: Column): Column

Returns an unordered array containing the values of the map.

Since: 2.3.0

def map_zip_with(left: Column, right: Column, f: (Column, Column, Column) ⇒ Column): Column

Merge two given maps, key-wise into a single map using a function.

df.select(map_zip_with(df("m1"), df("m2"), (k, v1, v2) => k === v1 + v2))

left: the left input map column
right: the right input map column
f: (key, value1, value2) => new_value, the lambda function to merge the map values

Since: 3.0.0

def mask(input: Column, upperChar: Column, lowerChar: Column, digitChar: Column, otherChar: Column): Column

Masks the given string value.

Masks the given string value. This can be useful for creating copies of tables with sensitive information removed.

input: string value to mask. Supported types: STRING, VARCHAR, CHAR
upperChar: character to replace upper-case characters with. Specify NULL to retain original character.
lowerChar: character to replace lower-case characters with. Specify NULL to retain original character.
digitChar: character to replace digit characters with. Specify NULL to retain original character.
otherChar: character to replace all other characters with. Specify NULL to retain original character.

Since: 3.5.0

def mask(input: Column, upperChar: Column, lowerChar: Column, digitChar: Column): Column

Masks the given string value.

Masks the given string value. The function replaces upper-case, lower-case characters and numbers with the characters specified respectively. This can be useful for creating copies of tables with sensitive information removed.

input: string value to mask. Supported types: STRING, VARCHAR, CHAR
upperChar: character to replace upper-case characters with. Specify NULL to retain original character.
lowerChar: character to replace lower-case characters with. Specify NULL to retain original character.
digitChar: character to replace digit characters with. Specify NULL to retain original character.

Since: 3.5.0

def mask(input: Column, upperChar: Column, lowerChar: Column): Column

Masks the given string value.

Masks the given string value. The function replaces upper-case and lower-case characters with the characters specified respectively, and numbers with 'n'. This can be useful for creating copies of tables with sensitive information removed.

input: string value to mask. Supported types: STRING, VARCHAR, CHAR
upperChar: character to replace upper-case characters with. Specify NULL to retain original character.
lowerChar: character to replace lower-case characters with. Specify NULL to retain original character.

Since: 3.5.0

def mask(input: Column, upperChar: Column): Column

Masks the given string value.

Masks the given string value. The function replaces upper-case characters with specific character, lower-case characters with 'x', and numbers with 'n'. This can be useful for creating copies of tables with sensitive information removed.

input: string value to mask. Supported types: STRING, VARCHAR, CHAR
upperChar: character to replace upper-case characters with. Specify NULL to retain original character.

Since: 3.5.0

def mask(input: Column): Column

Masks the given string value.

Masks the given string value. The function replaces characters with 'X' or 'x', and numbers with 'n'. This can be useful for creating copies of tables with sensitive information removed.

input: string value to mask. Supported types: STRING, VARCHAR, CHAR

Since: 3.5.0

def max(columnName: String): Column

Aggregate function: returns the maximum value of the column in a group.

Since: 1.3.0

def max(e: Column): Column

Aggregate function: returns the maximum value of the expression in a group.

Since: 1.3.0

def max_by(e: Column, ord: Column): Column

Aggregate function: returns the value associated with the maximum value of ord.

Since: 3.3.0

def md5(e: Column): Column

Calculates the MD5 digest of a binary column and returns the value as a 32 character hex string.

Since: 1.5.0

def mean(columnName: String): Column

Aggregate function: returns the average of the values in a group.

Aggregate function: returns the average of the values in a group. Alias for avg.

Since: 1.4.0

def mean(e: Column): Column

Aggregate function: returns the average of the values in a group.

Aggregate function: returns the average of the values in a group. Alias for avg.

Since: 1.4.0

def median(e: Column): Column

Aggregate function: returns the median of the values in a group.

Since: 3.4.0

def min(columnName: String): Column

Aggregate function: returns the minimum value of the column in a group.

Since: 1.3.0

def min(e: Column): Column

Aggregate function: returns the minimum value of the expression in a group.

Since: 1.3.0

def min_by(e: Column, ord: Column): Column

Aggregate function: returns the value associated with the minimum value of ord.

Since: 3.3.0

def minute(e: Column): Column

Extracts the minutes as an integer from a given date/timestamp/string.

returns: An integer, or null if the input was a string that could not be cast to a date

Since: 1.5.0

def mode(e: Column): Column

Aggregate function: returns the most frequent value in a group.

Since: 3.4.0

def monotonically_increasing_id(): Column

A column expression that generates monotonically increasing 64-bit integers.

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs:

0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.

Since: 1.6.0

def month(e: Column): Column

Extracts the month as an integer from a given date/timestamp/string.

returns: An integer, or null if the input was a string that could not be cast to a date

Since: 1.5.0

def months(e: Column): Column

A transform for timestamps and dates to partition data into months.

Since: 3.0.0

def months_between(end: Column, start: Column, roundOff: Boolean): Column

Returns number of months between dates end and start.

Returns number of months between dates end and start. If roundOff is set to true, the result is rounded off to 8 digits; it is not rounded otherwise.

Since: 2.4.0

def months_between(end: Column, start: Column): Column

Returns number of months between dates start and end.

A whole number is returned if both inputs have the same day of month or both are the last day of their respective months. Otherwise, the difference is calculated assuming 31 days per month.

For example:

months_between("2017-11-14", "2017-07-14")  // returns 4.0
months_between("2017-01-01", "2017-01-10")  // returns 0.29032258
months_between("2017-06-01", "2017-06-16 12:00:00")  // returns -0.5

end: A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
start: A date, timestamp or string. If a string, the data must be in a format that can cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
returns: A double, or null if either end or start were strings that could not be cast to a timestamp. Negative if end is before start

Since: 1.5.0

def named_struct(cols: Column*): Column

Creates a struct with the given field names and values.

Since: 3.5.0

def nanvl(col1: Column, col2: Column): Column

Returns col1 if it is not NaN, or col2 if col1 is NaN.

Both inputs should be floating point columns (DoubleType or FloatType).

Since: 1.5.0

final def ne(arg0: AnyRef): Boolean

Definition Classes: AnyRef

def negate(e: Column): Column

Unary minus, i.e.

Unary minus, i.e. negate the expression.

// Select the amount column and negates all values.
// Scala:
df.select( -df("amount") )

// Java:
df.select( negate(df.col("amount")) );

Since: 1.3.0

def negative(e: Column): Column

Returns the negated value.

Since: 3.5.0

def next_day(date: Column, dayOfWeek: Column): Column

Returns the first date which is later than the value of the date column that is on the specified day of the week.

For example, next_day('2015-07-27', "Sunday") returns 2015-08-02 because that is the first Sunday after 2015-07-27.

date: A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
dayOfWeek: A column of the day of week. Case insensitive, and accepts: "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"
returns: A date, or null if date was a string that could not be cast to a date or if dayOfWeek was an invalid value

Since: 3.2.0

def next_day(date: Column, dayOfWeek: String): Column

Returns the first date which is later than the value of the date column that is on the specified day of the week.

For example, next_day('2015-07-27', "Sunday") returns 2015-08-02 because that is the first Sunday after 2015-07-27.

date: A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS
dayOfWeek: Case insensitive, and accepts: "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"
returns: A date, or null if date was a string that could not be cast to a date or if dayOfWeek was an invalid value

Since: 1.5.0

def not(e: Column): Column

Inversion of boolean expression, i.e.

Inversion of boolean expression, i.e. NOT.

// Scala: select rows that are not active (isActive === false)
df.filter( !df("isActive") )

// Java:
df.filter( not(df.col("isActive")) );

Since: 1.3.0

final def notify(): Unit

Definition Classes: AnyRef
Annotations: @native()

final def notifyAll(): Unit

Definition Classes: AnyRef
Annotations: @native()

def now(): Column

Returns the current timestamp at the start of query evaluation.

Since: 3.5.0

def nth_value(e: Column, offset: Int): Column

Window function: returns the value that is the offsetth row of the window frame (counting from 1), and null if the size of window frame is less than offset rows.

This is equivalent to the nth_value function in SQL.

Since: 3.1.0

def nth_value(e: Column, offset: Int, ignoreNulls: Boolean): Column

Window function: returns the value that is the offsetth row of the window frame (counting from 1), and null if the size of window frame is less than offset rows.

It will return the offsetth non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

This is equivalent to the nth_value function in SQL.

Since: 3.1.0

def ntile(n: Int): Column

Window function: returns the ntile group id (from 1 to n inclusive) in an ordered window partition.

Window function: returns the ntile group id (from 1 to n inclusive) in an ordered window partition. For example, if n is 4, the first quarter of the rows will get value 1, the second quarter will get 2, the third quarter will get 3, and the last quarter will get 4.

This is equivalent to the NTILE function in SQL.

Since: 1.4.0

def nullif(col1: Column, col2: Column): Column

Returns null if col1 equals to col2, or col1 otherwise.

Since: 3.5.0

def nvl(col1: Column, col2: Column): Column

Returns col2 if col1 is null, or col1 otherwise.

Since: 3.5.0

def nvl2(col1: Column, col2: Column, col3: Column): Column

Returns col2 if col1 is not null, or col3 otherwise.

Since: 3.5.0

def octet_length(e: Column): Column

Calculates the byte length for the specified string column.

Since: 3.3.0

def overlay(src: Column, replace: Column, pos: Column): Column

Overlay the specified portion of src with replace, starting from byte position pos of src.

Since: 3.0.0

def overlay(src: Column, replace: Column, pos: Column, len: Column): Column

Overlay the specified portion of src with replace, starting from byte position pos of src and proceeding for len bytes.

Since: 3.0.0

def parse_url(url: Column, partToExtract: Column): Column

Extracts a part from a URL.

Since: 3.5.0

def parse_url(url: Column, partToExtract: Column, key: Column): Column

Extracts a part from a URL.

Since: 3.5.0

def percent_rank(): Column

Window function: returns the relative rank (i.e.

Window function: returns the relative rank (i.e. percentile) of rows within a window partition.

This is computed by:

(rank of row in its partition - 1) / (number of rows in the partition - 1)

This is equivalent to the PERCENT_RANK function in SQL.

Since: 1.6.0

def percentile(e: Column, percentage: Column, frequency: Column): Column

Aggregate function: returns the exact percentile(s) of numeric column expr at the given percentage(s) with value range in [0.0, 1.0].

Since: 3.5.0

def percentile(e: Column, percentage: Column): Column

Aggregate function: returns the exact percentile(s) of numeric column expr at the given percentage(s) with value range in [0.0, 1.0].

Since: 3.5.0

def percentile_approx(e: Column, percentage: Column, accuracy: Column): Column

Aggregate function: returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value.

If percentage is an array, each value must be between 0.0 and 1.0. If it is a single floating point value, it must be between 0.0 and 1.0.

The accuracy parameter is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation.

Since: 3.1.0

def pi(): Column

Returns Pi.

Since: 3.5.0

def pmod(dividend: Column, divisor: Column): Column

Returns the positive value of dividend mod divisor.

Since: 1.5.0

def posexplode(e: Column): Column

Creates a new row for each element with position in the given array or map column.

Creates a new row for each element with position in the given array or map column. Uses the default column name pos for position, and col for elements in the array and key and value for elements in the map unless specified otherwise.

Since: 2.1.0

def posexplode_outer(e: Column): Column

Creates a new row for each element with position in the given array or map column.

Creates a new row for each element with position in the given array or map column. Uses the default column name pos for position, and col for elements in the array and key and value for elements in the map unless specified otherwise. Unlike posexplode, if the array/map is null or empty then the row (null, null) is produced.

Since: 2.2.0

def position(substr: Column, str: Column): Column

Returns the position of the first occurrence of substr in str after position 1.

Returns the position of the first occurrence of substr in str after position 1. The return value are 1-based.

Since: 3.5.0

def position(substr: Column, str: Column, start: Column): Column

Returns the position of the first occurrence of substr in str after position start.

Returns the position of the first occurrence of substr in str after position start. The given start and return value are 1-based.

Since: 3.5.0

def positive(e: Column): Column

Returns the value.

Since: 3.5.0

def pow(l: Double, rightName: String): Column