Packages

o

org.apache.spark.sql

functions

object functions

Commonly used functions available for DataFrame operations. Using functions defined here provides a little bit more compile-time safety to make sure the function exists.

Spark also includes more built-in functions that are less common and are not defined here. You can still access them (and all the functions defined here) using the functions.expr() API and calling them through a SQL expression string. You can find the entire list of functions at SQL API documentation of your Spark version, see also the latest list

As an example, isnan is a function that is defined here. You can use isnan(col("myCol")) to invoke the isnan function. This way the programming language's compiler ensures isnan exists and is of the proper form. You can also use expr("isnan(myCol)") function to invoke the same function. In this case, Spark itself will ensure isnan exists when it analyzes the query.

regr_count is an example of a function that is built-in but not defined here, because it is less commonly used. To invoke it, use expr("regr_count(yCol, xCol)").

This function APIs usually have methods with Column signature only because it can support not only Column but also other types such as a native string. The other variants currently exist for historical reasons.

Annotations
@Stable()
Source
functions.scala
Since

1.3.0

Linear Supertypes
AnyRef, Any
Ordering
  1. Grouped
  2. Alphabetic
  3. By Inheritance
Inherited
  1. functions
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. def abs(e: Column): Column

    Computes the absolute value of a numeric value.

    Computes the absolute value of a numeric value.

    Since

    1.3.0

  5. def acos(columnName: String): Column

    returns

    inverse cosine of columnName, as if computed by java.lang.Math.acos

    Since

    1.4.0

  6. def acos(e: Column): Column

    returns

    inverse cosine of e in radians, as if computed by java.lang.Math.acos

    Since

    1.4.0

  7. def acosh(columnName: String): Column

    returns

    inverse hyperbolic cosine of columnName

    Since

    3.1.0

  8. def acosh(e: Column): Column

    returns

    inverse hyperbolic cosine of e

    Since

    3.1.0

  9. def add_months(startDate: Column, numMonths: Column): Column

    Returns the date that is numMonths after startDate.

    Returns the date that is numMonths after startDate.

    startDate

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    numMonths

    A column of the number of months to add to startDate, can be negative to subtract months

    returns

    A date, or null if startDate was a string that could not be cast to a date

    Since

    3.0.0

  10. def add_months(startDate: Column, numMonths: Int): Column

    Returns the date that is numMonths after startDate.

    Returns the date that is numMonths after startDate.

    startDate

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    numMonths

    The number of months to add to startDate, can be negative to subtract months

    returns

    A date, or null if startDate was a string that could not be cast to a date

    Since

    1.5.0

  11. def aggregate(expr: Column, initialValue: Column, merge: (Column, Column) ⇒ Column): Column

    Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.

    Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.

    df.select(aggregate(col("i"), lit(0), (acc, x) => acc + x))
    expr

    the input array column

    initialValue

    the initial value

    merge

    (combined_value, input_value) => combined_value, the merge function to merge an input value to the combined_value

    Since

    3.0.0

  12. def aggregate(expr: Column, initialValue: Column, merge: (Column, Column) ⇒ Column, finish: (Column) ⇒ Column): Column

    Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.

    Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function.

    df.select(aggregate(col("i"), lit(0), (acc, x) => acc + x, _ * 10))
    expr

    the input array column

    initialValue

    the initial value

    merge

    (combined_value, input_value) => combined_value, the merge function to merge an input value to the combined_value

    finish

    combined_value => final_value, the lambda function to convert the combined value of all inputs to final result

    Since

    3.0.0

  13. def approx_count_distinct(columnName: String, rsd: Double): Column

    Aggregate function: returns the approximate number of distinct items in a group.

    Aggregate function: returns the approximate number of distinct items in a group.

    rsd

    maximum relative standard deviation allowed (default = 0.05)

    Since

    2.1.0

  14. def approx_count_distinct(e: Column, rsd: Double): Column

    Aggregate function: returns the approximate number of distinct items in a group.

    Aggregate function: returns the approximate number of distinct items in a group.

    rsd

    maximum relative standard deviation allowed (default = 0.05)

    Since

    2.1.0

  15. def approx_count_distinct(columnName: String): Column

    Aggregate function: returns the approximate number of distinct items in a group.

    Aggregate function: returns the approximate number of distinct items in a group.

    Since

    2.1.0

  16. def approx_count_distinct(e: Column): Column

    Aggregate function: returns the approximate number of distinct items in a group.

    Aggregate function: returns the approximate number of distinct items in a group.

    Since

    2.1.0

  17. def array(colName: String, colNames: String*): Column

    Creates a new array column.

    Creates a new array column. The input columns must all have the same data type.

    Annotations
    @varargs()
    Since

    1.4.0

  18. def array(cols: Column*): Column

    Creates a new array column.

    Creates a new array column. The input columns must all have the same data type.

    Annotations
    @varargs()
    Since

    1.4.0

  19. def array_append(column: Column, element: Any): Column

    Returns an ARRAY containing all elements from the source ARRAY as well as the new element.

    Returns an ARRAY containing all elements from the source ARRAY as well as the new element. The new element/column is located at end of the ARRAY.

    Since

    3.4.0

  20. def array_compact(column: Column): Column

    Remove all null elements from the given array.

    Remove all null elements from the given array.

    Since

    3.4.0

  21. def array_contains(column: Column, value: Any): Column

    Returns null if the array is null, true if the array contains value, and false otherwise.

    Returns null if the array is null, true if the array contains value, and false otherwise.

    Since

    1.5.0

  22. def array_distinct(e: Column): Column

    Removes duplicate values from the array.

    Removes duplicate values from the array.

    Since

    2.4.0

  23. def array_except(col1: Column, col2: Column): Column

    Returns an array of the elements in the first array but not in the second array, without duplicates.

    Returns an array of the elements in the first array but not in the second array, without duplicates. The order of elements in the result is not determined

    Since

    2.4.0

  24. def array_insert(arr: Column, pos: Column, value: Column): Column

    Adds an item into a given array at a specified position

    Adds an item into a given array at a specified position

    Since

    3.4.0

  25. def array_intersect(col1: Column, col2: Column): Column

    Returns an array of the elements in the intersection of the given two arrays, without duplicates.

    Returns an array of the elements in the intersection of the given two arrays, without duplicates.

    Since

    2.4.0

  26. def array_join(column: Column, delimiter: String): Column

    Concatenates the elements of column using the delimiter.

    Concatenates the elements of column using the delimiter.

    Since

    2.4.0

  27. def array_join(column: Column, delimiter: String, nullReplacement: String): Column

    Concatenates the elements of column using the delimiter.

    Concatenates the elements of column using the delimiter. Null values are replaced with nullReplacement.

    Since

    2.4.0

  28. def array_max(e: Column): Column

    Returns the maximum value in the array.

    Returns the maximum value in the array. NaN is greater than any non-NaN elements for double/float type. NULL elements are skipped.

    Since

    2.4.0

  29. def array_min(e: Column): Column

    Returns the minimum value in the array.

    Returns the minimum value in the array. NaN is greater than any non-NaN elements for double/float type. NULL elements are skipped.

    Since

    2.4.0

  30. def array_position(column: Column, value: Any): Column

    Locates the position of the first occurrence of the value in the given array as long.

    Locates the position of the first occurrence of the value in the given array as long. Returns null if either of the arguments are null.

    Since

    2.4.0

    Note

    The position is not zero based, but 1 based index. Returns 0 if value could not be found in array.

  31. def array_remove(column: Column, element: Any): Column

    Remove all elements that equal to element from the given array.

    Remove all elements that equal to element from the given array.

    Since

    2.4.0

  32. def array_repeat(e: Column, count: Int): Column

    Creates an array containing the left argument repeated the number of times given by the right argument.

    Creates an array containing the left argument repeated the number of times given by the right argument.

    Since

    2.4.0

  33. def array_repeat(left: Column, right: Column): Column

    Creates an array containing the left argument repeated the number of times given by the right argument.

    Creates an array containing the left argument repeated the number of times given by the right argument.

    Since

    2.4.0

  34. def array_sort(e: Column, comparator: (Column, Column) ⇒ Column): Column

    Sorts the input array based on the given comparator function.

    Sorts the input array based on the given comparator function. The comparator will take two arguments representing two elements of the array. It returns a negative integer, 0, or a positive integer as the first element is less than, equal to, or greater than the second element. If the comparator function returns null, the function will fail and raise an error.

    Since

    3.4.0

  35. def array_sort(e: Column): Column

    Sorts the input array in ascending order.

    Sorts the input array in ascending order. The elements of the input array must be orderable. NaN is greater than any non-NaN elements for double/float type. Null elements will be placed at the end of the returned array.

    Since

    2.4.0

  36. def array_union(col1: Column, col2: Column): Column

    Returns an array of the elements in the union of the given two arrays, without duplicates.

    Returns an array of the elements in the union of the given two arrays, without duplicates.

    Since

    2.4.0

  37. def arrays_overlap(a1: Column, a2: Column): Column

    Returns true if a1 and a2 have at least one non-null element in common.

    Returns true if a1 and a2 have at least one non-null element in common. If not and both the arrays are non-empty and any of them contains a null, it returns null. It returns false otherwise.

    Since

    2.4.0

  38. def arrays_zip(e: Column*): Column

    Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.

    Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.

    Annotations
    @varargs()
    Since

    2.4.0

  39. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  40. def asc(columnName: String): Column

    Returns a sort expression based on ascending order of the column.

    Returns a sort expression based on ascending order of the column.

    df.sort(asc("dept"), desc("age"))
    Since

    1.3.0

  41. def asc_nulls_first(columnName: String): Column

    Returns a sort expression based on ascending order of the column, and null values return before non-null values.

    Returns a sort expression based on ascending order of the column, and null values return before non-null values.

    df.sort(asc_nulls_first("dept"), desc("age"))
    Since

    2.1.0

  42. def asc_nulls_last(columnName: String): Column

    Returns a sort expression based on ascending order of the column, and null values appear after non-null values.

    Returns a sort expression based on ascending order of the column, and null values appear after non-null values.

    df.sort(asc_nulls_last("dept"), desc("age"))
    Since

    2.1.0

  43. def ascii(e: Column): Column

    Computes the numeric value of the first character of the string column, and returns the result as an int column.

    Computes the numeric value of the first character of the string column, and returns the result as an int column.

    Since

    1.5.0

  44. def asin(columnName: String): Column

    returns

    inverse sine of columnName, as if computed by java.lang.Math.asin

    Since

    1.4.0

  45. def asin(e: Column): Column

    returns

    inverse sine of e in radians, as if computed by java.lang.Math.asin

    Since

    1.4.0

  46. def asinh(columnName: String): Column

    returns

    inverse hyperbolic sine of columnName

    Since

    3.1.0

  47. def asinh(e: Column): Column

    returns

    inverse hyperbolic sine of e

    Since

    3.1.0

  48. def assert_true(c: Column, e: Column): Column

    Returns null if the condition is true; throws an exception with the error message otherwise.

    Returns null if the condition is true; throws an exception with the error message otherwise.

    Since

    3.1.0

  49. def assert_true(c: Column): Column

    Returns null if the condition is true, and throws an exception otherwise.

    Returns null if the condition is true, and throws an exception otherwise.

    Since

    3.1.0

  50. def atan(columnName: String): Column

    returns

    inverse tangent of columnName, as if computed by java.lang.Math.atan

    Since

    1.4.0

  51. def atan(e: Column): Column

    returns

    inverse tangent of e as if computed by java.lang.Math.atan

    Since

    1.4.0

  52. def atan2(yValue: Double, xName: String): Column

    yValue

    coordinate on y-axis

    xName

    coordinate on x-axis

    returns

    the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2

    Since

    1.4.0

  53. def atan2(yValue: Double, x: Column): Column

    yValue

    coordinate on y-axis

    x

    coordinate on x-axis

    returns

    the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2

    Since

    1.4.0

  54. def atan2(yName: String, xValue: Double): Column

    yName

    coordinate on y-axis

    xValue

    coordinate on x-axis

    returns

    the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2

    Since

    1.4.0

  55. def atan2(y: Column, xValue: Double): Column

    y

    coordinate on y-axis

    xValue

    coordinate on x-axis

    returns

    the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2

    Since

    1.4.0

  56. def atan2(yName: String, xName: String): Column

    yName

    coordinate on y-axis

    xName

    coordinate on x-axis

    returns

    the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2

    Since

    1.4.0

  57. def atan2(yName: String, x: Column): Column

    yName

    coordinate on y-axis

    x

    coordinate on x-axis

    returns

    the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2

    Since

    1.4.0

  58. def atan2(y: Column, xName: String): Column

    y

    coordinate on y-axis

    xName

    coordinate on x-axis

    returns

    the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2

    Since

    1.4.0

  59. def atan2(y: Column, x: Column): Column

    y

    coordinate on y-axis

    x

    coordinate on x-axis

    returns

    the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2

    Since

    1.4.0

  60. def atanh(columnName: String): Column

    returns

    inverse hyperbolic tangent of columnName

    Since

    3.1.0

  61. def atanh(e: Column): Column

    returns

    inverse hyperbolic tangent of e

    Since

    3.1.0

  62. def avg(columnName: String): Column

    Aggregate function: returns the average of the values in a group.

    Aggregate function: returns the average of the values in a group.

    Since

    1.3.0

  63. def avg(e: Column): Column

    Aggregate function: returns the average of the values in a group.

    Aggregate function: returns the average of the values in a group.

    Since

    1.3.0

  64. def base64(e: Column): Column

    Computes the BASE64 encoding of a binary column and returns it as a string column.

    Computes the BASE64 encoding of a binary column and returns it as a string column. This is the reverse of unbase64.

    Since

    1.5.0

  65. def bin(columnName: String): Column

    An expression that returns the string representation of the binary value of the given long column.

    An expression that returns the string representation of the binary value of the given long column. For example, bin("12") returns "1100".

    Since

    1.5.0

  66. def bin(e: Column): Column

    An expression that returns the string representation of the binary value of the given long column.

    An expression that returns the string representation of the binary value of the given long column. For example, bin("12") returns "1100".

    Since

    1.5.0

  67. def bit_length(e: Column): Column

    Calculates the bit length for the specified string column.

    Calculates the bit length for the specified string column.

    Since

    3.3.0

  68. def bitwise_not(e: Column): Column

    Computes bitwise NOT (~) of a number.

    Computes bitwise NOT (~) of a number.

    Since

    3.2.0

  69. def broadcast[T](df: Dataset[T]): Dataset[T]

    Marks a DataFrame as small enough for use in broadcast joins.

    Marks a DataFrame as small enough for use in broadcast joins.

    The following example marks the right DataFrame for broadcast hash join using joinKey.

    // left and right are DataFrames
    left.join(broadcast(right), "joinKey")
    Since

    1.5.0

  70. def bround(e: Column, scale: Int): Column

    Round the value of e to scale decimal places with HALF_EVEN round mode if scale is greater than or equal to 0 or at integral part when scale is less than 0.

    Round the value of e to scale decimal places with HALF_EVEN round mode if scale is greater than or equal to 0 or at integral part when scale is less than 0.

    Since

    2.0.0

  71. def bround(e: Column): Column

    Returns the value of the column e rounded to 0 decimal places with HALF_EVEN round mode.

    Returns the value of the column e rounded to 0 decimal places with HALF_EVEN round mode.

    Since

    2.0.0

  72. def bucket(numBuckets: Int, e: Column): Column

    A transform for any type that partitions by a hash of the input column.

    A transform for any type that partitions by a hash of the input column.

    Since

    3.0.0

  73. def bucket(numBuckets: Column, e: Column): Column

    A transform for any type that partitions by a hash of the input column.

    A transform for any type that partitions by a hash of the input column.

    Since

    3.0.0

  74. def call_udf(udfName: String, cols: Column*): Column

    Call an user-defined function.

    Call an user-defined function. Example:

    import org.apache.spark.sql._
    
    val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
    val spark = df.sparkSession
    spark.udf.register("simpleUDF", (v: Int) => v * v)
    df.select($"id", call_udf("simpleUDF", $"value"))
    Annotations
    @varargs()
    Since

    3.2.0

  75. def cbrt(columnName: String): Column

    Computes the cube-root of the given column.

    Computes the cube-root of the given column.

    Since

    1.4.0

  76. def cbrt(e: Column): Column

    Computes the cube-root of the given value.

    Computes the cube-root of the given value.

    Since

    1.4.0

  77. def ceil(columnName: String): Column

    Computes the ceiling of the given value of e to 0 decimal places.

    Computes the ceiling of the given value of e to 0 decimal places.

    Since

    1.4.0

  78. def ceil(e: Column): Column

    Computes the ceiling of the given value of e to 0 decimal places.

    Computes the ceiling of the given value of e to 0 decimal places.

    Since

    1.4.0

  79. def ceil(e: Column, scale: Column): Column

    Computes the ceiling of the given value of e to scale decimal places.

    Computes the ceiling of the given value of e to scale decimal places.

    Since

    3.3.0

  80. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  81. def coalesce(e: Column*): Column

    Returns the first column that is not null, or null if all inputs are null.

    Returns the first column that is not null, or null if all inputs are null.

    For example, coalesce(a, b, c) will return a if a is not null, or b if a is null and b is not null, or c if both a and b are null but c is not null.

    Annotations
    @varargs()
    Since

    1.3.0

  82. def col(colName: String): Column

    Returns a Column based on the given column name.

    Returns a Column based on the given column name.

    Since

    1.3.0

  83. def collect_list(columnName: String): Column

    Aggregate function: returns a list of objects with duplicates.

    Aggregate function: returns a list of objects with duplicates.

    Since

    1.6.0

    Note

    The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

  84. def collect_list(e: Column): Column

    Aggregate function: returns a list of objects with duplicates.

    Aggregate function: returns a list of objects with duplicates.

    Since

    1.6.0

    Note

    The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

  85. def collect_set(columnName: String): Column

    Aggregate function: returns a set of objects with duplicate elements eliminated.

    Aggregate function: returns a set of objects with duplicate elements eliminated.

    Since

    1.6.0

    Note

    The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

  86. def collect_set(e: Column): Column

    Aggregate function: returns a set of objects with duplicate elements eliminated.

    Aggregate function: returns a set of objects with duplicate elements eliminated.

    Since

    1.6.0

    Note

    The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

  87. def column(colName: String): Column

    Returns a Column based on the given column name.

    Returns a Column based on the given column name. Alias of col.

    Since

    1.3.0

  88. def concat(exprs: Column*): Column

    Concatenates multiple input columns together into a single column.

    Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.

    Annotations
    @varargs()
    Since

    1.5.0

  89. def concat_ws(sep: String, exprs: Column*): Column

    Concatenates multiple input string columns together into a single string column, using the given separator.

    Concatenates multiple input string columns together into a single string column, using the given separator.

    Annotations
    @varargs()
    Since

    1.5.0

  90. def conv(num: Column, fromBase: Int, toBase: Int): Column

    Convert a number in a string column from one base to another.

    Convert a number in a string column from one base to another.

    Since

    1.5.0

  91. def corr(columnName1: String, columnName2: String): Column

    Aggregate function: returns the Pearson Correlation Coefficient for two columns.

    Aggregate function: returns the Pearson Correlation Coefficient for two columns.

    Since

    1.6.0

  92. def corr(column1: Column, column2: Column): Column

    Aggregate function: returns the Pearson Correlation Coefficient for two columns.

    Aggregate function: returns the Pearson Correlation Coefficient for two columns.

    Since

    1.6.0

  93. def cos(columnName: String): Column

    columnName

    angle in radians

    returns

    cosine of the angle, as if computed by java.lang.Math.cos

    Since

    1.4.0

  94. def cos(e: Column): Column

    e

    angle in radians

    returns

    cosine of the angle, as if computed by java.lang.Math.cos

    Since

    1.4.0

  95. def cosh(columnName: String): Column

    columnName

    hyperbolic angle

    returns

    hyperbolic cosine of the angle, as if computed by java.lang.Math.cosh

    Since

    1.4.0

  96. def cosh(e: Column): Column

    e

    hyperbolic angle

    returns

    hyperbolic cosine of the angle, as if computed by java.lang.Math.cosh

    Since

    1.4.0

  97. def cot(e: Column): Column

    e

    angle in radians

    returns

    cotangent of the angle

    Since

    3.3.0

  98. def count(columnName: String): TypedColumn[Any, Long]

    Aggregate function: returns the number of items in a group.

    Aggregate function: returns the number of items in a group.

    Since

    1.3.0

  99. def count(e: Column): Column

    Aggregate function: returns the number of items in a group.

    Aggregate function: returns the number of items in a group.

    Since

    1.3.0

  100. def countDistinct(columnName: String, columnNames: String*): Column

    Aggregate function: returns the number of distinct items in a group.

    Aggregate function: returns the number of distinct items in a group.

    An alias of count_distinct, and it is encouraged to use count_distinct directly.

    Annotations
    @varargs()
    Since

    1.3.0

  101. def countDistinct(expr: Column, exprs: Column*): Column

    Aggregate function: returns the number of distinct items in a group.

    Aggregate function: returns the number of distinct items in a group.

    An alias of count_distinct, and it is encouraged to use count_distinct directly.

    Annotations
    @varargs()
    Since

    1.3.0

  102. def count_distinct(expr: Column, exprs: Column*): Column

    Aggregate function: returns the number of distinct items in a group.

    Aggregate function: returns the number of distinct items in a group.

    Annotations
    @varargs()
    Since

    3.2.0

  103. def covar_pop(columnName1: String, columnName2: String): Column

    Aggregate function: returns the population covariance for two columns.

    Aggregate function: returns the population covariance for two columns.

    Since

    2.0.0

  104. def covar_pop(column1: Column, column2: Column): Column

    Aggregate function: returns the population covariance for two columns.

    Aggregate function: returns the population covariance for two columns.

    Since

    2.0.0

  105. def covar_samp(columnName1: String, columnName2: String): Column

    Aggregate function: returns the sample covariance for two columns.

    Aggregate function: returns the sample covariance for two columns.

    Since

    2.0.0

  106. def covar_samp(column1: Column, column2: Column): Column

    Aggregate function: returns the sample covariance for two columns.

    Aggregate function: returns the sample covariance for two columns.

    Since

    2.0.0

  107. def crc32(e: Column): Column

    Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a bigint.

    Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a bigint.

    Since

    1.5.0

  108. def csc(e: Column): Column

    e

    angle in radians

    returns

    cosecant of the angle

    Since

    3.3.0

  109. def cume_dist(): Column

    Window function: returns the cumulative distribution of values within a window partition, i.e.

    Window function: returns the cumulative distribution of values within a window partition, i.e. the fraction of rows that are below the current row.

    N = total number of rows in the partition
    cumeDist(x) = number of values before (and including) x / N
    Since

    1.6.0

  110. def current_date(): Column

    Returns the current date at the start of query evaluation as a date column.

    Returns the current date at the start of query evaluation as a date column. All calls of current_date within the same query return the same value.

    Since

    1.5.0

  111. def current_timestamp(): Column

    Returns the current timestamp at the start of query evaluation as a timestamp column.

    Returns the current timestamp at the start of query evaluation as a timestamp column. All calls of current_timestamp within the same query return the same value.

    Since

    1.5.0

  112. def date_add(start: Column, days: Column): Column

    Returns the date that is days days after start

    Returns the date that is days days after start

    start

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    days

    A column of the number of days to add to start, can be negative to subtract days

    returns

    A date, or null if start was a string that could not be cast to a date

    Since

    3.0.0

  113. def date_add(start: Column, days: Int): Column

    Returns the date that is days days after start

    Returns the date that is days days after start

    start

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    days

    The number of days to add to start, can be negative to subtract days

    returns

    A date, or null if start was a string that could not be cast to a date

    Since

    1.5.0

  114. def date_format(dateExpr: Column, format: String): Column

    Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.

    Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.

    See Datetime Patterns for valid date and time format patterns

    dateExpr

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    format

    A pattern dd.MM.yyyy would return a string like 18.03.1993

    returns

    A string, or null if dateExpr was a string that could not be cast to a timestamp

    Since

    1.5.0

    Exceptions thrown

    IllegalArgumentException if the format pattern is invalid

    Note

    Use specialized functions like year whenever possible as they benefit from a specialized implementation.

  115. def date_sub(start: Column, days: Column): Column

    Returns the date that is days days before start

    Returns the date that is days days before start

    start

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    days

    A column of the number of days to subtract from start, can be negative to add days

    returns

    A date, or null if start was a string that could not be cast to a date

    Since

    3.0.0

  116. def date_sub(start: Column, days: Int): Column

    Returns the date that is days days before start

    Returns the date that is days days before start

    start

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    days

    The number of days to subtract from start, can be negative to add days

    returns

    A date, or null if start was a string that could not be cast to a date

    Since

    1.5.0

  117. def date_trunc(format: String, timestamp: Column): Column

    Returns timestamp truncated to the unit specified by the format.

    Returns timestamp truncated to the unit specified by the format.

    For example, date_trunc("year", "2018-11-19 12:01:19") returns 2018-01-01 00:00:00

    timestamp

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    returns

    A timestamp, or null if timestamp was a string that could not be cast to a timestamp or format was an invalid value

    Since

    2.3.0

  118. def datediff(end: Column, start: Column): Column

    Returns the number of days from start to end.

    Returns the number of days from start to end.

    Only considers the date part of the input. For example:

    dateddiff("2018-01-10 00:00:00", "2018-01-09 23:59:59")
    // returns 1
    end

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    start

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    returns

    An integer, or null if either end or start were strings that could not be cast to a date. Negative if end is before start

    Since

    1.5.0

  119. def dayofmonth(e: Column): Column

    Extracts the day of the month as an integer from a given date/timestamp/string.

    Extracts the day of the month as an integer from a given date/timestamp/string.

    returns

    An integer, or null if the input was a string that could not be cast to a date

    Since

    1.5.0

  120. def dayofweek(e: Column): Column

    Extracts the day of the week as an integer from a given date/timestamp/string.

    Extracts the day of the week as an integer from a given date/timestamp/string. Ranges from 1 for a Sunday through to 7 for a Saturday

    returns

    An integer, or null if the input was a string that could not be cast to a date

    Since

    2.3.0

  121. def dayofyear(e: Column): Column

    Extracts the day of the year as an integer from a given date/timestamp/string.

    Extracts the day of the year as an integer from a given date/timestamp/string.

    returns

    An integer, or null if the input was a string that could not be cast to a date

    Since

    1.5.0

  122. def days(e: Column): Column

    A transform for timestamps and dates to partition data into days.

    A transform for timestamps and dates to partition data into days.

    Since

    3.0.0

  123. def decode(value: Column, charset: String): Column

    Computes the first argument into a string from a binary using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16').

    Computes the first argument into a string from a binary using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). If either argument is null, the result will also be null.

    Since

    1.5.0

  124. def degrees(columnName: String): Column

    Converts an angle measured in radians to an approximately equivalent angle measured in degrees.

    Converts an angle measured in radians to an approximately equivalent angle measured in degrees.

    columnName

    angle in radians

    returns

    angle in degrees, as if computed by java.lang.Math.toDegrees

    Since

    2.1.0

  125. def degrees(e: Column): Column

    Converts an angle measured in radians to an approximately equivalent angle measured in degrees.

    Converts an angle measured in radians to an approximately equivalent angle measured in degrees.

    e

    angle in radians

    returns

    angle in degrees, as if computed by java.lang.Math.toDegrees

    Since

    2.1.0

  126. def dense_rank(): Column

    Window function: returns the rank of rows within a window partition, without any gaps.

    Window function: returns the rank of rows within a window partition, without any gaps.

    The difference between rank and dense_rank is that denseRank leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using dense_rank and had three people tie for second place, you would say that all three were in second place and that the next person came in third. Rank would give me sequential numbers, making the person that came in third place (after the ties) would register as coming in fifth.

    This is equivalent to the DENSE_RANK function in SQL.

    Since

    1.6.0

  127. def desc(columnName: String): Column

    Returns a sort expression based on the descending order of the column.

    Returns a sort expression based on the descending order of the column.

    df.sort(asc("dept"), desc("age"))
    Since

    1.3.0

  128. def desc_nulls_first(columnName: String): Column

    Returns a sort expression based on the descending order of the column, and null values appear before non-null values.

    Returns a sort expression based on the descending order of the column, and null values appear before non-null values.

    df.sort(asc("dept"), desc_nulls_first("age"))
    Since

    2.1.0

  129. def desc_nulls_last(columnName: String): Column

    Returns a sort expression based on the descending order of the column, and null values appear after non-null values.

    Returns a sort expression based on the descending order of the column, and null values appear after non-null values.

    df.sort(asc("dept"), desc_nulls_last("age"))
    Since

    2.1.0

  130. def element_at(column: Column, value: Any): Column

    Returns element of array at given index in value if column is array.

    Returns element of array at given index in value if column is array. Returns value for the given key in value if column is map.

    Since

    2.4.0

  131. def encode(value: Column, charset: String): Column

    Computes the first argument into a binary from a string using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16').

    Computes the first argument into a binary from a string using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). If either argument is null, the result will also be null.

    Since

    1.5.0

  132. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  133. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  134. def exists(column: Column, f: (Column) ⇒ Column): Column

    Returns whether a predicate holds for one or more elements in the array.

    Returns whether a predicate holds for one or more elements in the array.

    df.select(exists(col("i"), _ % 2 === 0))
    column

    the input array column

    f

    col => predicate, the Boolean predicate to check the input column

    Since

    3.0.0

  135. def exp(columnName: String): Column

    Computes the exponential of the given column.

    Computes the exponential of the given column.

    Since

    1.4.0

  136. def exp(e: Column): Column

    Computes the exponential of the given value.

    Computes the exponential of the given value.

    Since

    1.4.0

  137. def explode(e: Column): Column

    Creates a new row for each element in the given array or map column.

    Creates a new row for each element in the given array or map column. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise.

    Since

    1.3.0

  138. def explode_outer(e: Column): Column

    Creates a new row for each element in the given array or map column.

    Creates a new row for each element in the given array or map column. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise. Unlike explode, if the array/map is null or empty then null is produced.

    Since

    2.2.0

  139. def expm1(columnName: String): Column

    Computes the exponential of the given column minus one.

    Computes the exponential of the given column minus one.

    Since

    1.4.0

  140. def expm1(e: Column): Column

    Computes the exponential of the given value minus one.

    Computes the exponential of the given value minus one.

    Since

    1.4.0

  141. def expr(expr: String): Column

    Parses the expression string into the column that it represents, similar to Dataset#selectExpr.

    Parses the expression string into the column that it represents, similar to Dataset#selectExpr.

    // get the number of words of each length
    df.groupBy(expr("length(word)")).count()
  142. def factorial(e: Column): Column

    Computes the factorial of the given value.

    Computes the factorial of the given value.

    Since

    1.5.0

  143. def filter(column: Column, f: (Column, Column) ⇒ Column): Column

    Returns an array of elements for which a predicate holds in a given array.

    Returns an array of elements for which a predicate holds in a given array.

    df.select(filter(col("s"), (x, i) => i % 2 === 0))
    column

    the input array column

    f

    (col, index) => predicate, the Boolean predicate to filter the input column given the index. Indices start at 0.

    Since

    3.0.0

  144. def filter(column: Column, f: (Column) ⇒ Column): Column

    Returns an array of elements for which a predicate holds in a given array.

    Returns an array of elements for which a predicate holds in a given array.

    df.select(filter(col("s"), x => x % 2 === 0))
    column

    the input array column

    f

    col => predicate, the Boolean predicate to filter the input column

    Since

    3.0.0

  145. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  146. def first(columnName: String): Column

    Aggregate function: returns the first value of a column in a group.

    Aggregate function: returns the first value of a column in a group.

    The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

    Since

    1.3.0

    Note

    The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

  147. def first(e: Column): Column

    Aggregate function: returns the first value in a group.

    Aggregate function: returns the first value in a group.

    The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

    Since

    1.3.0

    Note

    The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

  148. def first(columnName: String, ignoreNulls: Boolean): Column

    Aggregate function: returns the first value of a column in a group.

    Aggregate function: returns the first value of a column in a group.

    The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

    Since

    2.0.0

    Note

    The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

  149. def first(e: Column, ignoreNulls: Boolean): Column

    Aggregate function: returns the first value in a group.

    Aggregate function: returns the first value in a group.

    The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

    Since

    2.0.0

    Note

    The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

  150. def flatten(e: Column): Column

    Creates a single array from an array of arrays.

    Creates a single array from an array of arrays. If a structure of nested arrays is deeper than two levels, only one level of nesting is removed.

    Since

    2.4.0

  151. def floor(columnName: String): Column

    Computes the floor of the given column value to 0 decimal places.

    Computes the floor of the given column value to 0 decimal places.

    Since

    1.4.0

  152. def floor(e: Column): Column

    Computes the floor of the given value of e to 0 decimal places.

    Computes the floor of the given value of e to 0 decimal places.

    Since

    1.4.0

  153. def floor(e: Column, scale: Column): Column

    Computes the floor of the given value of e to scale decimal places.

    Computes the floor of the given value of e to scale decimal places.

    Since

    3.3.0

  154. def forall(column: Column, f: (Column) ⇒ Column): Column

    Returns whether a predicate holds for every element in the array.

    Returns whether a predicate holds for every element in the array.

    df.select(forall(col("i"), x => x % 2 === 0))
    column

    the input array column

    f

    col => predicate, the Boolean predicate to check the input column

    Since

    3.0.0

  155. def format_number(x: Column, d: Int): Column

    Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column.

    Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column.

    If d is 0, the result has no decimal point or fractional part. If d is less than 0, the result will be null.

    Since

    1.5.0

  156. def format_string(format: String, arguments: Column*): Column

    Formats the arguments in printf-style and returns the result as a string column.

    Formats the arguments in printf-style and returns the result as a string column.

    Annotations
    @varargs()
    Since

    1.5.0

  157. def from_csv(e: Column, schema: Column, options: Map[String, String]): Column

    (Java-specific) Parses a column containing a CSV string into a StructType with the specified schema.

    (Java-specific) Parses a column containing a CSV string into a StructType with the specified schema. Returns null, in the case of an unparseable string.

    e

    a string column containing CSV data.

    schema

    the schema to use when parsing the CSV string

    options

    options to control how the CSV is parsed. accepts the same options and the CSV data source. See Data Source Option in the version you use.

    Since

    3.0.0

  158. def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column

    Parses a column containing a CSV string into a StructType with the specified schema.

    Parses a column containing a CSV string into a StructType with the specified schema. Returns null, in the case of an unparseable string.

    e

    a string column containing CSV data.

    schema

    the schema to use when parsing the CSV string

    options

    options to control how the CSV is parsed. accepts the same options and the CSV data source. See Data Source Option in the version you use.

    Since

    3.0.0

  159. def from_json(e: Column, schema: Column, options: Map[String, String]): Column

    (Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType of StructTypes with the specified schema.

    (Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType of StructTypes with the specified schema. Returns null, in the case of an unparseable string.

    e

    a string column containing JSON data.

    schema

    the schema to use when parsing the json string

    options

    options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.

    Since

    2.4.0

  160. def from_json(e: Column, schema: Column): Column

    (Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType of StructTypes with the specified schema.

    (Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType of StructTypes with the specified schema. Returns null, in the case of an unparseable string.

    e

    a string column containing JSON data.

    schema

    the schema to use when parsing the json string

    Since

    2.4.0

  161. def from_json(e: Column, schema: String, options: Map[String, String]): Column

    (Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.

    (Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string.

    e

    a string column containing JSON data.

    schema

    the schema as a DDL-formatted string.

    options

    options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.

    Since

    2.3.0

  162. def from_json(e: Column, schema: String, options: Map[String, String]): Column

    (Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.

    (Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string.

    e

    a string column containing JSON data.

    schema

    the schema as a DDL-formatted string.

    options

    options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.

    Since

    2.1.0

  163. def from_json(e: Column, schema: DataType): Column

    Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.

    Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string.

    e

    a string column containing JSON data.

    schema

    the schema to use when parsing the json string

    Since

    2.2.0

  164. def from_json(e: Column, schema: StructType): Column

    Parses a column containing a JSON string into a StructType with the specified schema.

    Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.

    e

    a string column containing JSON data.

    schema

    the schema to use when parsing the json string

    Since

    2.1.0

  165. def from_json(e: Column, schema: DataType, options: Map[String, String]): Column

    (Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.

    (Java-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string.

    e

    a string column containing JSON data.

    schema

    the schema to use when parsing the json string

    options

    options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.

    Since

    2.2.0

  166. def from_json(e: Column, schema: StructType, options: Map[String, String]): Column

    (Java-specific) Parses a column containing a JSON string into a StructType with the specified schema.

    (Java-specific) Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.

    e

    a string column containing JSON data.

    schema

    the schema to use when parsing the json string

    options

    options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.

    Since

    2.1.0

  167. def from_json(e: Column, schema: DataType, options: Map[String, String]): Column

    (Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.

    (Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string.

    e

    a string column containing JSON data.

    schema

    the schema to use when parsing the json string

    options

    options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.

    Since

    2.2.0

  168. def from_json(e: Column, schema: StructType, options: Map[String, String]): Column

    (Scala-specific) Parses a column containing a JSON string into a StructType with the specified schema.

    (Scala-specific) Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.

    e

    a string column containing JSON data.

    schema

    the schema to use when parsing the json string

    options

    options to control how the json is parsed. Accepts the same options as the json data source. See Data Source Option in the version you use.

    Since

    2.1.0

  169. def from_unixtime(ut: Column, f: String): Column

    Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.

    Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.

    See Datetime Patterns for valid date and time format patterns

    ut

    A number of a type that is castable to a long, such as string or integer. Can be negative for timestamps before the unix epoch

    f

    A date time pattern that the input will be formatted to

    returns

    A string, or null if ut was a string that could not be cast to a long or f was an invalid date time pattern

    Since

    1.5.0

  170. def from_unixtime(ut: Column): Column

    Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format.

    Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format.

    ut

    A number of a type that is castable to a long, such as string or integer. Can be negative for timestamps before the unix epoch

    returns

    A string, or null if the input was a string that could not be cast to a long

    Since

    1.5.0

  171. def from_utc_timestamp(ts: Column, tz: Column): Column

    Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone.

    Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, 'GMT+1' would yield '2017-07-14 03:40:00.0'.

    Since

    2.4.0

  172. def from_utc_timestamp(ts: Column, tz: String): Column

    Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone.

    Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, 'GMT+1' would yield '2017-07-14 03:40:00.0'.

    ts

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    tz

    A string detailing the time zone ID that the input should be adjusted to. It should be in the format of either region-based zone IDs or zone offsets. Region IDs must have the form 'area/city', such as 'America/Los_Angeles'. Zone offsets must be in the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'. Other short names are not recommended to use because they can be ambiguous.

    returns

    A timestamp, or null if ts was a string that could not be cast to a timestamp or tz was an invalid value

    Since

    1.5.0

  173. def get(column: Column, index: Column): Column

    Returns element of array at given (0-based) index.

    Returns element of array at given (0-based) index. If the index points outside of the array boundaries, then this function returns NULL.

    Since

    3.4.0

  174. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  175. def get_json_object(e: Column, path: String): Column

    Extracts json object from a json string based on json path specified, and returns json string of the extracted json object.

    Extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid.

    Since

    1.6.0

  176. def greatest(columnName: String, columnNames: String*): Column

    Returns the greatest value of the list of column names, skipping null values.

    Returns the greatest value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.

    Annotations
    @varargs()
    Since

    1.5.0

  177. def greatest(exprs: Column*): Column

    Returns the greatest value of the list of values, skipping null values.

    Returns the greatest value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.

    Annotations
    @varargs()
    Since

    1.5.0

  178. def grouping(columnName: String): Column

    Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.

    Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.

    Since

    2.0.0

  179. def grouping(e: Column): Column

    Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.

    Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.

    Since

    2.0.0

  180. def grouping_id(colName: String, colNames: String*): Column

    Aggregate function: returns the level of grouping, equals to

    Aggregate function: returns the level of grouping, equals to

    (grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn)
    Since

    2.0.0

    Note

    The list of columns should match with grouping columns exactly.

  181. def grouping_id(cols: Column*): Column

    Aggregate function: returns the level of grouping, equals to

    Aggregate function: returns the level of grouping, equals to

    (grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn)
    Since

    2.0.0

    Note

    The list of columns should match with grouping columns exactly, or empty (means all the grouping columns).

  182. def hash(cols: Column*): Column

    Calculates the hash code of given columns, and returns the result as an int column.

    Calculates the hash code of given columns, and returns the result as an int column.

    Annotations
    @varargs()
    Since

    2.0.0

  183. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  184. def hex(column: Column): Column

    Computes hex value of the given column.

    Computes hex value of the given column.

    Since

    1.5.0

  185. def hour(e: Column): Column

    Extracts the hours as an integer from a given date/timestamp/string.

    Extracts the hours as an integer from a given date/timestamp/string.

    returns

    An integer, or null if the input was a string that could not be cast to a date

    Since

    1.5.0

  186. def hours(e: Column): Column

    A transform for timestamps to partition data into hours.

    A transform for timestamps to partition data into hours.

    Since

    3.0.0

  187. def hypot(l: Double, rightName: String): Column

    Computes sqrt(a2 + b2) without intermediate overflow or underflow.

    Computes sqrt(a2 + b2) without intermediate overflow or underflow.

    Since

    1.4.0

  188. def hypot(l: Double, r: Column): Column

    Computes sqrt(a2 + b2) without intermediate overflow or underflow.

    Computes sqrt(a2 + b2) without intermediate overflow or underflow.

    Since

    1.4.0

  189. def hypot(leftName: String, r: Double): Column

    Computes sqrt(a2 + b2) without intermediate overflow or underflow.

    Computes sqrt(a2 + b2) without intermediate overflow or underflow.

    Since

    1.4.0

  190. def hypot(l: Column, r: Double): Column

    Computes sqrt(a2 + b2) without intermediate overflow or underflow.

    Computes sqrt(a2 + b2) without intermediate overflow or underflow.

    Since

    1.4.0

  191. def hypot(leftName: String, rightName: String): Column

    Computes sqrt(a2 + b2) without intermediate overflow or underflow.

    Computes sqrt(a2 + b2) without intermediate overflow or underflow.

    Since

    1.4.0

  192. def hypot(leftName: String, r: Column): Column

    Computes sqrt(a2 + b2) without intermediate overflow or underflow.

    Computes sqrt(a2 + b2) without intermediate overflow or underflow.

    Since

    1.4.0

  193. def hypot(l: Column, rightName: String): Column

    Computes sqrt(a2 + b2) without intermediate overflow or underflow.

    Computes sqrt(a2 + b2) without intermediate overflow or underflow.

    Since

    1.4.0

  194. def hypot(l: Column, r: Column): Column

    Computes sqrt(a2 + b2) without intermediate overflow or underflow.

    Computes sqrt(a2 + b2) without intermediate overflow or underflow.

    Since

    1.4.0

  195. def initcap(e: Column): Column

    Returns a new string column by converting the first letter of each word to uppercase.

    Returns a new string column by converting the first letter of each word to uppercase. Words are delimited by whitespace.

    For example, "hello world" will become "Hello World".

    Since

    1.5.0

  196. def inline(e: Column): Column

    Creates a new row for each element in the given array of structs.

    Creates a new row for each element in the given array of structs.

    Since

    3.4.0

  197. def inline_outer(e: Column): Column

    Creates a new row for each element in the given array of structs.

    Creates a new row for each element in the given array of structs. Unlike inline, if the array is null or empty then null is produced for each nested column.

    Since

    3.4.0

  198. def input_file_name(): Column

    Creates a string column for the file name of the current Spark task.

    Creates a string column for the file name of the current Spark task.

    Since

    1.6.0

  199. def instr(str: Column, substring: String): Column

    Locate the position of the first occurrence of substr column in the given string.

    Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.

    Since

    1.5.0

    Note

    The position is not zero based, but 1 based index. Returns 0 if substr could not be found in str.

  200. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  201. def isnan(e: Column): Column

    Return true iff the column is NaN.

    Return true iff the column is NaN.

    Since

    1.6.0

  202. def isnull(e: Column): Column

    Return true iff the column is null.

    Return true iff the column is null.

    Since

    1.6.0

  203. def json_tuple(json: Column, fields: String*): Column

    Creates a new row for a json column according to the given field names.

    Creates a new row for a json column according to the given field names.

    Annotations
    @varargs()
    Since

    1.6.0

  204. def kurtosis(columnName: String): Column

    Aggregate function: returns the kurtosis of the values in a group.

    Aggregate function: returns the kurtosis of the values in a group.

    Since

    1.6.0

  205. def kurtosis(e: Column): Column

    Aggregate function: returns the kurtosis of the values in a group.

    Aggregate function: returns the kurtosis of the values in a group.

    Since

    1.6.0

  206. def lag(e: Column, offset: Int, defaultValue: Any, ignoreNulls: Boolean): Column

    Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row.

    Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. ignoreNulls determines whether null values of row are included in or eliminated from the calculation. For example, an offset of one will return the previous row at any given point in the window partition.

    This is equivalent to the LAG function in SQL.

    Since

    3.2.0

  207. def lag(e: Column, offset: Int, defaultValue: Any): Column

    Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row.

    Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.

    This is equivalent to the LAG function in SQL.

    Since

    1.4.0

  208. def lag(columnName: String, offset: Int, defaultValue: Any): Column

    Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row.

    Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.

    This is equivalent to the LAG function in SQL.

    Since

    1.4.0

  209. def lag(columnName: String, offset: Int): Column

    Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row.

    Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.

    This is equivalent to the LAG function in SQL.

    Since

    1.4.0

  210. def lag(e: Column, offset: Int): Column

    Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row.

    Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.

    This is equivalent to the LAG function in SQL.

    Since

    1.4.0

  211. def last(columnName: String): Column

    Aggregate function: returns the last value of the column in a group.

    Aggregate function: returns the last value of the column in a group.

    The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

    Since

    1.3.0

    Note

    The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

  212. def last(e: Column): Column

    Aggregate function: returns the last value in a group.

    Aggregate function: returns the last value in a group.

    The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

    Since

    1.3.0

    Note

    The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

  213. def last(columnName: String, ignoreNulls: Boolean): Column

    Aggregate function: returns the last value of the column in a group.

    Aggregate function: returns the last value of the column in a group.

    The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

    Since

    2.0.0

    Note

    The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

  214. def last(e: Column, ignoreNulls: Boolean): Column

    Aggregate function: returns the last value in a group.

    Aggregate function: returns the last value in a group.

    The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

    Since

    2.0.0

    Note

    The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.

  215. def last_day(e: Column): Column

    Returns the last day of the month which the given date belongs to.

    Returns the last day of the month which the given date belongs to. For example, input "2015-07-27" returns "2015-07-31" since July 31 is the last day of the month in July 2015.

    e

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    returns

    A date, or null if the input was a string that could not be cast to a date

    Since

    1.5.0

  216. def lead(e: Column, offset: Int, defaultValue: Any, ignoreNulls: Boolean): Column

    Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row.

    Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row. ignoreNulls determines whether null values of row are included in or eliminated from the calculation. The default value of ignoreNulls is false. For example, an offset of one will return the next row at any given point in the window partition.

    This is equivalent to the LEAD function in SQL.

    Since

    3.2.0

  217. def lead(e: Column, offset: Int, defaultValue: Any): Column

    Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row.

    Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.

    This is equivalent to the LEAD function in SQL.

    Since

    1.4.0

  218. def lead(columnName: String, offset: Int, defaultValue: Any): Column

    Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row.

    Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.

    This is equivalent to the LEAD function in SQL.

    Since

    1.4.0

  219. def lead(e: Column, offset: Int): Column

    Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row.

    Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.

    This is equivalent to the LEAD function in SQL.

    Since

    1.4.0

  220. def lead(columnName: String, offset: Int): Column

    Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row.

    Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.

    This is equivalent to the LEAD function in SQL.

    Since

    1.4.0

  221. def least(columnName: String, columnNames: String*): Column

    Returns the least value of the list of column names, skipping null values.

    Returns the least value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.

    Annotations
    @varargs()
    Since

    1.5.0

  222. def least(exprs: Column*): Column

    Returns the least value of the list of values, skipping null values.

    Returns the least value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.

    Annotations
    @varargs()
    Since

    1.5.0

  223. def length(e: Column): Column

    Computes the character length of a given string or number of bytes of a binary string.

    Computes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros.

    Since

    1.5.0

  224. def levenshtein(l: Column, r: Column): Column

    Computes the Levenshtein distance of the two given string columns.

    Computes the Levenshtein distance of the two given string columns.

    Since

    1.5.0

  225. def lit(literal: Any): Column

    Creates a Column of literal value.

    Creates a Column of literal value.

    The passed in object is returned directly if it is already a Column. If the object is a Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value.

    Since

    1.3.0

  226. def localtimestamp(): Column

    Returns the current timestamp without time zone at the start of query evaluation as a timestamp without time zone column.

    Returns the current timestamp without time zone at the start of query evaluation as a timestamp without time zone column. All calls of localtimestamp within the same query return the same value.

    Since

    3.3.0

  227. def locate(substr: String, str: Column, pos: Int): Column

    Locate the position of the first occurrence of substr in a string column, after position pos.

    Locate the position of the first occurrence of substr in a string column, after position pos.

    Since

    1.5.0

    Note

    The position is not zero based, but 1 based index. returns 0 if substr could not be found in str.

  228. def locate(substr: String, str: Column): Column

    Locate the position of the first occurrence of substr.

    Locate the position of the first occurrence of substr.

    Since

    1.5.0

    Note

    The position is not zero based, but 1 based index. Returns 0 if substr could not be found in str.

  229. def log(base: Double, columnName: String): Column

    Returns the first argument-base logarithm of the second argument.

    Returns the first argument-base logarithm of the second argument.

    Since

    1.4.0

  230. def log(base: Double, a: Column): Column

    Returns the first argument-base logarithm of the second argument.

    Returns the first argument-base logarithm of the second argument.

    Since

    1.4.0

  231. def log(columnName: String): Column

    Computes the natural logarithm of the given column.

    Computes the natural logarithm of the given column.

    Since

    1.4.0

  232. def log(e: Column): Column

    Computes the natural logarithm of the given value.

    Computes the natural logarithm of the given value.

    Since

    1.4.0

  233. def log10(columnName: String): Column

    Computes the logarithm of the given value in base 10.

    Computes the logarithm of the given value in base 10.

    Since

    1.4.0

  234. def log10(e: Column): Column

    Computes the logarithm of the given value in base 10.

    Computes the logarithm of the given value in base 10.

    Since

    1.4.0

  235. def log1p(columnName: String): Column

    Computes the natural logarithm of the given column plus one.

    Computes the natural logarithm of the given column plus one.

    Since

    1.4.0

  236. def log1p(e: Column): Column

    Computes the natural logarithm of the given value plus one.

    Computes the natural logarithm of the given value plus one.

    Since

    1.4.0

  237. def log2(columnName: String): Column

    Computes the logarithm of the given value in base 2.

    Computes the logarithm of the given value in base 2.

    Since

    1.5.0

  238. def log2(expr: Column): Column

    Computes the logarithm of the given column in base 2.

    Computes the logarithm of the given column in base 2.

    Since

    1.5.0

  239. def lower(e: Column): Column

    Converts a string column to lower case.

    Converts a string column to lower case.

    Since

    1.3.0

  240. def lpad(str: Column, len: Int, pad: Array[Byte]): Column

    Left-pad the binary column with pad to a byte length of len.

    Left-pad the binary column with pad to a byte length of len. If the binary column is longer than len, the return value is shortened to len bytes.

    Since

    3.3.0

  241. def lpad(str: Column, len: Int, pad: String): Column

    Left-pad the string column with pad to a length of len.

    Left-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.

    Since

    1.5.0

  242. def ltrim(e: Column, trimString: String): Column

    Trim the specified character string from left end for the specified string column.

    Trim the specified character string from left end for the specified string column.

    Since

    2.3.0

  243. def ltrim(e: Column): Column

    Trim the spaces from left end for the specified string value.

    Trim the spaces from left end for the specified string value.

    Since

    1.5.0

  244. def make_date(year: Column, month: Column, day: Column): Column

    returns

    A date created from year, month and day fields.

    Since

    3.3.0

  245. def map(cols: Column*): Column

    Creates a new map column.

    Creates a new map column. The input columns must be grouped as key-value pairs, e.g. (key1, value1, key2, value2, ...). The key columns must all have the same data type, and can't be null. The value columns must all have the same data type.

    Annotations
    @varargs()
    Since

    2.0

  246. def map_concat(cols: Column*): Column

    Returns the union of all the given maps.

    Returns the union of all the given maps.

    Annotations
    @varargs()
    Since

    2.4.0

  247. def map_contains_key(column: Column, key: Any): Column

    Returns true if the map contains the key.

    Returns true if the map contains the key.

    Since

    3.3.0

  248. def map_entries(e: Column): Column

    Returns an unordered array of all entries in the given map.

    Returns an unordered array of all entries in the given map.

    Since

    3.0.0

  249. def map_filter(expr: Column, f: (Column, Column) ⇒ Column): Column

    Returns a map whose key-value pairs satisfy a predicate.

    Returns a map whose key-value pairs satisfy a predicate.

    df.select(map_filter(col("m"), (k, v) => k * 10 === v))
    expr

    the input map column

    f

    (key, value) => predicate, the Boolean predicate to filter the input map column

    Since

    3.0.0

  250. def map_from_arrays(keys: Column, values: Column): Column

    Creates a new map column.

    Creates a new map column. The array in the first column is used for keys. The array in the second column is used for values. All elements in the array for key should not be null.

    Since

    2.4

  251. def map_from_entries(e: Column): Column

    Returns a map created from the given array of entries.

    Returns a map created from the given array of entries.

    Since

    2.4.0

  252. def map_keys(e: Column): Column

    Returns an unordered array containing the keys of the map.

    Returns an unordered array containing the keys of the map.

    Since

    2.3.0

  253. def map_values(e: Column): Column

    Returns an unordered array containing the values of the map.

    Returns an unordered array containing the values of the map.

    Since

    2.3.0

  254. def map_zip_with(left: Column, right: Column, f: (Column, Column, Column) ⇒ Column): Column

    Merge two given maps, key-wise into a single map using a function.

    Merge two given maps, key-wise into a single map using a function.

    df.select(map_zip_with(df("m1"), df("m2"), (k, v1, v2) => k === v1 + v2))
    left

    the left input map column

    right

    the right input map column

    f

    (key, value1, value2) => new_value, the lambda function to merge the map values

    Since

    3.0.0

  255. def max(columnName: String): Column

    Aggregate function: returns the maximum value of the column in a group.

    Aggregate function: returns the maximum value of the column in a group.

    Since

    1.3.0

  256. def max(e: Column): Column

    Aggregate function: returns the maximum value of the expression in a group.

    Aggregate function: returns the maximum value of the expression in a group.

    Since

    1.3.0

  257. def max_by(e: Column, ord: Column): Column

    Aggregate function: returns the value associated with the maximum value of ord.

    Aggregate function: returns the value associated with the maximum value of ord.

    Since

    3.3.0

  258. def md5(e: Column): Column

    Calculates the MD5 digest of a binary column and returns the value as a 32 character hex string.

    Calculates the MD5 digest of a binary column and returns the value as a 32 character hex string.

    Since

    1.5.0

  259. def mean(columnName: String): Column

    Aggregate function: returns the average of the values in a group.

    Aggregate function: returns the average of the values in a group. Alias for avg.

    Since

    1.4.0

  260. def mean(e: Column): Column

    Aggregate function: returns the average of the values in a group.

    Aggregate function: returns the average of the values in a group. Alias for avg.

    Since

    1.4.0

  261. def median(e: Column): Column

    Aggregate function: returns the median of the values in a group.

    Aggregate function: returns the median of the values in a group.

    Since

    3.4.0

  262. def min(columnName: String): Column

    Aggregate function: returns the minimum value of the column in a group.

    Aggregate function: returns the minimum value of the column in a group.

    Since

    1.3.0

  263. def min(e: Column): Column

    Aggregate function: returns the minimum value of the expression in a group.

    Aggregate function: returns the minimum value of the expression in a group.

    Since

    1.3.0

  264. def min_by(e: Column, ord: Column): Column

    Aggregate function: returns the value associated with the minimum value of ord.

    Aggregate function: returns the value associated with the minimum value of ord.

    Since

    3.3.0

  265. def minute(e: Column): Column

    Extracts the minutes as an integer from a given date/timestamp/string.

    Extracts the minutes as an integer from a given date/timestamp/string.

    returns

    An integer, or null if the input was a string that could not be cast to a date

    Since

    1.5.0

  266. def mode(e: Column): Column

    Aggregate function: returns the most frequent value in a group.

    Aggregate function: returns the most frequent value in a group.

    Since

    3.4.0

  267. def monotonically_increasing_id(): Column

    A column expression that generates monotonically increasing 64-bit integers.

    A column expression that generates monotonically increasing 64-bit integers.

    The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

    As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs:

    0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
    Since

    1.6.0

  268. def month(e: Column): Column

    Extracts the month as an integer from a given date/timestamp/string.

    Extracts the month as an integer from a given date/timestamp/string.

    returns

    An integer, or null if the input was a string that could not be cast to a date

    Since

    1.5.0

  269. def months(e: Column): Column

    A transform for timestamps and dates to partition data into months.

    A transform for timestamps and dates to partition data into months.

    Since

    3.0.0

  270. def months_between(end: Column, start: Column, roundOff: Boolean): Column

    Returns number of months between dates end and start.

    Returns number of months between dates end and start. If roundOff is set to true, the result is rounded off to 8 digits; it is not rounded otherwise.

    Since

    2.4.0

  271. def months_between(end: Column, start: Column): Column

    Returns number of months between dates start and end.

    Returns number of months between dates start and end.

    A whole number is returned if both inputs have the same day of month or both are the last day of their respective months. Otherwise, the difference is calculated assuming 31 days per month.

    For example:

    months_between("2017-11-14", "2017-07-14")  // returns 4.0
    months_between("2017-01-01", "2017-01-10")  // returns 0.29032258
    months_between("2017-06-01", "2017-06-16 12:00:00")  // returns -0.5
    end

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    start

    A date, timestamp or string. If a string, the data must be in a format that can cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    returns

    A double, or null if either end or start were strings that could not be cast to a timestamp. Negative if end is before start

    Since

    1.5.0

  272. def nanvl(col1: Column, col2: Column): Column

    Returns col1 if it is not NaN, or col2 if col1 is NaN.

    Returns col1 if it is not NaN, or col2 if col1 is NaN.

    Both inputs should be floating point columns (DoubleType or FloatType).

    Since

    1.5.0

  273. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  274. def negate(e: Column): Column

    Unary minus, i.e.

    Unary minus, i.e. negate the expression.

    // Select the amount column and negates all values.
    // Scala:
    df.select( -df("amount") )
    
    // Java:
    df.select( negate(df.col("amount")) );
    Since

    1.3.0

  275. def next_day(date: Column, dayOfWeek: Column): Column

    Returns the first date which is later than the value of the date column that is on the specified day of the week.

    Returns the first date which is later than the value of the date column that is on the specified day of the week.

    For example, next_day('2015-07-27', "Sunday") returns 2015-08-02 because that is the first Sunday after 2015-07-27.

    date

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    dayOfWeek

    A column of the day of week. Case insensitive, and accepts: "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"

    returns

    A date, or null if date was a string that could not be cast to a date or if dayOfWeek was an invalid value

    Since

    3.2.0

  276. def next_day(date: Column, dayOfWeek: String): Column

    Returns the first date which is later than the value of the date column that is on the specified day of the week.

    Returns the first date which is later than the value of the date column that is on the specified day of the week.

    For example, next_day('2015-07-27', "Sunday") returns 2015-08-02 because that is the first Sunday after 2015-07-27.

    date

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    dayOfWeek

    Case insensitive, and accepts: "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"

    returns

    A date, or null if date was a string that could not be cast to a date or if dayOfWeek was an invalid value

    Since

    1.5.0

  277. def not(e: Column): Column

    Inversion of boolean expression, i.e.

    Inversion of boolean expression, i.e. NOT.

    // Scala: select rows that are not active (isActive === false)
    df.filter( !df("isActive") )
    
    // Java:
    df.filter( not(df.col("isActive")) );
    Since

    1.3.0

  278. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  279. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  280. def nth_value(e: Column, offset: Int): Column

    Window function: returns the value that is the offsetth row of the window frame (counting from 1), and null if the size of window frame is less than offset rows.

    Window function: returns the value that is the offsetth row of the window frame (counting from 1), and null if the size of window frame is less than offset rows.

    This is equivalent to the nth_value function in SQL.

    Since

    3.1.0

  281. def nth_value(e: Column, offset: Int, ignoreNulls: Boolean): Column

    Window function: returns the value that is the offsetth row of the window frame (counting from 1), and null if the size of window frame is less than offset rows.

    Window function: returns the value that is the offsetth row of the window frame (counting from 1), and null if the size of window frame is less than offset rows.

    It will return the offsetth non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

    This is equivalent to the nth_value function in SQL.

    Since

    3.1.0

  282. def ntile(n: Int): Column

    Window function: returns the ntile group id (from 1 to n inclusive) in an ordered window partition.

    Window function: returns the ntile group id (from 1 to n inclusive) in an ordered window partition. For example, if n is 4, the first quarter of the rows will get value 1, the second quarter will get 2, the third quarter will get 3, and the last quarter will get 4.

    This is equivalent to the NTILE function in SQL.

    Since

    1.4.0

  283. def octet_length(e: Column): Column

    Calculates the byte length for the specified string column.

    Calculates the byte length for the specified string column.

    Since

    3.3.0

  284. def overlay(src: Column, replace: Column, pos: Column): Column

    Overlay the specified portion of src with replace, starting from byte position pos of src.

    Overlay the specified portion of src with replace, starting from byte position pos of src.

    Since

    3.0.0

  285. def overlay(src: Column, replace: Column, pos: Column, len: Column): Column

    Overlay the specified portion of src with replace, starting from byte position pos of src and proceeding for len bytes.

    Overlay the specified portion of src with replace, starting from byte position pos of src and proceeding for len bytes.

    Since

    3.0.0

  286. def percent_rank(): Column

    Window function: returns the relative rank (i.e.

    Window function: returns the relative rank (i.e. percentile) of rows within a window partition.

    This is computed by:

    (rank of row in its partition - 1) / (number of rows in the partition - 1)

    This is equivalent to the PERCENT_RANK function in SQL.

    Since

    1.6.0

  287. def percentile_approx(e: Column, percentage: Column, accuracy: Column): Column

    Aggregate function: returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value.

    Aggregate function: returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value.

    If percentage is an array, each value must be between 0.0 and 1.0. If it is a single floating point value, it must be between 0.0 and 1.0.

    The accuracy parameter is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation.

    Since

    3.1.0

  288. def pmod(dividend: Column, divisor: Column): Column

    Returns the positive value of dividend mod divisor.

    Returns the positive value of dividend mod divisor.

    Since

    1.5.0

  289. def posexplode(e: Column): Column

    Creates a new row for each element with position in the given array or map column.

    Creates a new row for each element with position in the given array or map column. Uses the default column name pos for position, and col for elements in the array and key and value for elements in the map unless specified otherwise.

    Since

    2.1.0

  290. def posexplode_outer(e: Column): Column

    Creates a new row for each element with position in the given array or map column.

    Creates a new row for each element with position in the given array or map column. Uses the default column name pos for position, and col for elements in the array and key and value for elements in the map unless specified otherwise. Unlike posexplode, if the array/map is null or empty then the row (null, null) is produced.

    Since

    2.2.0

  291. def pow(l: Double, rightName: String): Column

    Returns the value of the first argument raised to the power of the second argument.

    Returns the value of the first argument raised to the power of the second argument.

    Since

    1.4.0

  292. def pow(l: Double, r: Column): Column

    Returns the value of the first argument raised to the power of the second argument.

    Returns the value of the first argument raised to the power of the second argument.

    Since

    1.4.0

  293. def pow(leftName: String, r: Double): Column

    Returns the value of the first argument raised to the power of the second argument.

    Returns the value of the first argument raised to the power of the second argument.

    Since

    1.4.0

  294. def pow(l: Column, r: Double): Column

    Returns the value of the first argument raised to the power of the second argument.

    Returns the value of the first argument raised to the power of the second argument.

    Since

    1.4.0

  295. def pow(leftName: String, rightName: String): Column

    Returns the value of the first argument raised to the power of the second argument.

    Returns the value of the first argument raised to the power of the second argument.

    Since

    1.4.0

  296. def pow(leftName: String, r: Column): Column

    Returns the value of the first argument raised to the power of the second argument.

    Returns the value of the first argument raised to the power of the second argument.

    Since

    1.4.0

  297. def pow(l: Column, rightName: String): Column

    Returns the value of the first argument raised to the power of the second argument.

    Returns the value of the first argument raised to the power of the second argument.

    Since

    1.4.0

  298. def pow(l: Column, r: Column): Column

    Returns the value of the first argument raised to the power of the second argument.

    Returns the value of the first argument raised to the power of the second argument.

    Since

    1.4.0

  299. def product(e: Column): Column

    Aggregate function: returns the product of all numerical elements in a group.

    Aggregate function: returns the product of all numerical elements in a group.

    Since

    3.2.0

  300. def quarter(e: Column): Column

    Extracts the quarter as an integer from a given date/timestamp/string.

    Extracts the quarter as an integer from a given date/timestamp/string.

    returns

    An integer, or null if the input was a string that could not be cast to a date

    Since

    1.5.0

  301. def radians(columnName: String): Column

    Converts an angle measured in degrees to an approximately equivalent angle measured in radians.

    Converts an angle measured in degrees to an approximately equivalent angle measured in radians.

    columnName

    angle in degrees

    returns

    angle in radians, as if computed by java.lang.Math.toRadians

    Since

    2.1.0

  302. def radians(e: Column): Column

    Converts an angle measured in degrees to an approximately equivalent angle measured in radians.

    Converts an angle measured in degrees to an approximately equivalent angle measured in radians.

    e

    angle in degrees

    returns

    angle in radians, as if computed by java.lang.Math.toRadians

    Since

    2.1.0

  303. def raise_error(c: Column): Column

    Throws an exception with the provided error message.

    Throws an exception with the provided error message.

    Since

    3.1.0

  304. def rand(): Column

    Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).

    Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).

    Since

    1.4.0

    Note

    The function is non-deterministic in general case.

  305. def rand(seed: Long): Column

    Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).

    Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).

    Since

    1.4.0

    Note

    The function is non-deterministic in general case.

  306. def randn(): Column

    Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.

    Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.

    Since

    1.4.0

    Note

    The function is non-deterministic in general case.

  307. def randn(seed: Long): Column

    Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.

    Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.

    Since

    1.4.0

    Note

    The function is non-deterministic in general case.

  308. def rank(): Column

    Window function: returns the rank of rows within a window partition.

    Window function: returns the rank of rows within a window partition.

    The difference between rank and dense_rank is that dense_rank leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using dense_rank and had three people tie for second place, you would say that all three were in second place and that the next person came in third. Rank would give me sequential numbers, making the person that came in third place (after the ties) would register as coming in fifth.

    This is equivalent to the RANK function in SQL.

    Since

    1.4.0

  309. def regexp_extract(e: Column, exp: String, groupIdx: Int): Column

    Extract a specific group matched by a Java regex, from the specified string column.

    Extract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned. if the specified group index exceeds the group count of regex, an IllegalArgumentException will be thrown.

    Since

    1.5.0

  310. def regexp_replace(e: Column, pattern: Column, replacement: Column): Column

    Replace all substrings of the specified string value that match regexp with rep.

    Replace all substrings of the specified string value that match regexp with rep.

    Since

    2.1.0

  311. def regexp_replace(e: Column, pattern: String, replacement: String): Column

    Replace all substrings of the specified string value that match regexp with rep.

    Replace all substrings of the specified string value that match regexp with rep.

    Since

    1.5.0

  312. def repeat(str: Column, n: Int): Column

    Repeats a string column n times, and returns it as a new string column.

    Repeats a string column n times, and returns it as a new string column.

    Since

    1.5.0

  313. def reverse(e: Column): Column

    Returns a reversed string or an array with reverse order of elements.

    Returns a reversed string or an array with reverse order of elements.

    Since

    1.5.0

  314. def rint(columnName: String): Column

    Returns the double value that is closest in value to the argument and is equal to a mathematical integer.

    Returns the double value that is closest in value to the argument and is equal to a mathematical integer.

    Since

    1.4.0

  315. def rint(e: Column): Column

    Returns the double value that is closest in value to the argument and is equal to a mathematical integer.

    Returns the double value that is closest in value to the argument and is equal to a mathematical integer.

    Since

    1.4.0

  316. def round(e: Column, scale: Int): Column

    Round the value of e to scale decimal places with HALF_UP round mode if scale is greater than or equal to 0 or at integral part when scale is less than 0.

    Round the value of e to scale decimal places with HALF_UP round mode if scale is greater than or equal to 0 or at integral part when scale is less than 0.

    Since

    1.5.0

  317. def round(e: Column): Column

    Returns the value of the column e rounded to 0 decimal places with HALF_UP round mode.

    Returns the value of the column e rounded to 0 decimal places with HALF_UP round mode.

    Since

    1.5.0

  318. def row_number(): Column

    Window function: returns a sequential number starting at 1 within a window partition.

    Window function: returns a sequential number starting at 1 within a window partition.

    Since

    1.6.0

  319. def rpad(str: Column, len: Int, pad: Array[Byte]): Column

    Right-pad the binary column with pad to a byte length of len.

    Right-pad the binary column with pad to a byte length of len. If the binary column is longer than len, the return value is shortened to len bytes.

    Since

    3.3.0

  320. def rpad(str: Column, len: Int, pad: String): Column

    Right-pad the string column with pad to a length of len.

    Right-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.

    Since

    1.5.0

  321. def rtrim(e: Column, trimString: String): Column

    Trim the specified character string from right end for the specified string column.

    Trim the specified character string from right end for the specified string column.

    Since

    2.3.0

  322. def rtrim(e: Column): Column

    Trim the spaces from right end for the specified string value.

    Trim the spaces from right end for the specified string value.

    Since

    1.5.0

  323. def schema_of_csv(csv: Column, options: Map[String, String]): Column

    Parses a CSV string and infers its schema in DDL format using options.

    Parses a CSV string and infers its schema in DDL format using options.

    csv

    a foldable string column containing a CSV string.

    options

    options to control how the CSV is parsed. accepts the same options and the CSV data source. See Data Source Option in the version you use.

    returns

    a column with string literal containing schema in DDL format.

    Since

    3.0.0

  324. def schema_of_csv(csv: Column): Column

    Parses a CSV string and infers its schema in DDL format.

    Parses a CSV string and infers its schema in DDL format.

    csv

    a foldable string column containing a CSV string.

    Since

    3.0.0

  325. def schema_of_csv(csv: String): Column

    Parses a CSV string and infers its schema in DDL format.

    Parses a CSV string and infers its schema in DDL format.

    csv

    a CSV string.

    Since

    3.0.0

  326. def schema_of_json(json: Column, options: Map[String, String]): Column

    Parses a JSON string and infers its schema in DDL format using options.

    Parses a JSON string and infers its schema in DDL format using options.

    json

    a foldable string column containing JSON data.

    options

    options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.

    returns

    a column with string literal containing schema in DDL format.

    Since

    3.0.0

  327. def schema_of_json(json: Column): Column

    Parses a JSON string and infers its schema in DDL format.

    Parses a JSON string and infers its schema in DDL format.

    json

    a foldable string column containing a JSON string.

    Since

    2.4.0

  328. def schema_of_json(json: String): Column

    Parses a JSON string and infers its schema in DDL format.

    Parses a JSON string and infers its schema in DDL format.

    json

    a JSON string.

    Since

    2.4.0

  329. def sec(e: Column): Column

    e

    angle in radians

    returns

    secant of the angle

    Since

    3.3.0

  330. def second(e: Column): Column

    Extracts the seconds as an integer from a given date/timestamp/string.

    Extracts the seconds as an integer from a given date/timestamp/string.

    returns

    An integer, or null if the input was a string that could not be cast to a timestamp

    Since

    1.5.0

  331. def sentences(string: Column): Column

    Splits a string into arrays of sentences, where each sentence is an array of words.

    Splits a string into arrays of sentences, where each sentence is an array of words. The default locale is used.

    Since

    3.2.0

  332. def sentences(string: Column, language: Column, country: Column): Column

    Splits a string into arrays of sentences, where each sentence is an array of words.

    Splits a string into arrays of sentences, where each sentence is an array of words.

    Since

    3.2.0

  333. def sequence(start: Column, stop: Column): Column

    Generate a sequence of integers from start to stop, incrementing by 1 if start is less than or equal to stop, otherwise -1.

    Generate a sequence of integers from start to stop, incrementing by 1 if start is less than or equal to stop, otherwise -1.

    Since

    2.4.0

  334. def sequence(start: Column, stop: Column, step: Column): Column

    Generate a sequence of integers from start to stop, incrementing by step.

    Generate a sequence of integers from start to stop, incrementing by step.

    Since

    2.4.0

  335. def session_window(timeColumn: Column, gapDuration: Column): Column

    Generates session window given a timestamp specifying column.

    Generates session window given a timestamp specifying column.

    Session window is one of dynamic windows, which means the length of window is varying according to the given inputs. For static gap duration, the length of session window is defined as "the timestamp of latest input of the session + gap duration", so when the new inputs are bound to the current session window, the end time of session window can be expanded according to the new inputs.

    Besides a static gap duration value, users can also provide an expression to specify gap duration dynamically based on the input row. With dynamic gap duration, the closing of a session window does not depend on the latest input anymore. A session window's range is the union of all events' ranges which are determined by event start time and evaluated gap duration during the query execution. Note that the rows with negative or zero gap duration will be filtered out from the aggregation.

    Windows can support microsecond precision. gapDuration in the order of months are not supported.

    For a streaming query, you may use the function current_timestamp to generate windows on processing time.

    timeColumn

    The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType.

    gapDuration

    A column specifying the timeout of the session. It could be static value, e.g. 10 minutes, 1 second, or an expression/UDF that specifies gap duration dynamically based on the input row.

    Since

    3.2.0

  336. def session_window(timeColumn: Column, gapDuration: String): Column

    Generates session window given a timestamp specifying column.

    Generates session window given a timestamp specifying column.

    Session window is one of dynamic windows, which means the length of window is varying according to the given inputs. The length of session window is defined as "the timestamp of latest input of the session + gap duration", so when the new inputs are bound to the current session window, the end time of session window can be expanded according to the new inputs.

    Windows can support microsecond precision. gapDuration in the order of months are not supported.

    For a streaming query, you may use the function current_timestamp to generate windows on processing time.

    timeColumn

    The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType.

    gapDuration

    A string specifying the timeout of the session, e.g. 10 minutes, 1 second. Check org.apache.spark.unsafe.types.CalendarInterval for valid duration identifiers.

    Since

    3.2.0

  337. def sha1(e: Column): Column

    Calculates the SHA-1 digest of a binary column and returns the value as a 40 character hex string.

    Calculates the SHA-1 digest of a binary column and returns the value as a 40 character hex string.

    Since

    1.5.0

  338. def sha2(e: Column, numBits: Int): Column

    Calculates the SHA-2 family of hash functions of a binary column and returns the value as a hex string.

    Calculates the SHA-2 family of hash functions of a binary column and returns the value as a hex string.

    e

    column to compute SHA-2 on.

    numBits

    one of 224, 256, 384, or 512.

    Since

    1.5.0

  339. def shiftleft(e: Column, numBits: Int): Column

    Shift the given value numBits left.

    Shift the given value numBits left. If the given value is a long value, this function will return a long value else it will return an integer value.

    Since

    3.2.0

  340. def shiftright(e: Column, numBits: Int): Column

    (Signed) shift the given value numBits right.

    (Signed) shift the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value.

    Since

    3.2.0

  341. def shiftrightunsigned(e: Column, numBits: Int): Column

    Unsigned shift the given value numBits right.

    Unsigned shift the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value.

    Since

    3.2.0

  342. def shuffle(e: Column): Column

    Returns a random permutation of the given array.

    Returns a random permutation of the given array.

    Since

    2.4.0

    Note

    The function is non-deterministic.

  343. def signum(columnName: String): Column

    Computes the signum of the given column.

    Computes the signum of the given column.

    Since

    1.4.0

  344. def signum(e: Column): Column

    Computes the signum of the given value.

    Computes the signum of the given value.

    Since

    1.4.0

  345. def sin(columnName: String): Column

    columnName

    angle in radians

    returns

    sine of the angle, as if computed by java.lang.Math.sin

    Since

    1.4.0

  346. def sin(e: Column): Column

    e

    angle in radians

    returns

    sine of the angle, as if computed by java.lang.Math.sin

    Since

    1.4.0

  347. def sinh(columnName: String): Column

    columnName

    hyperbolic angle

    returns

    hyperbolic sine of the given value, as if computed by java.lang.Math.sinh

    Since

    1.4.0

  348. def sinh(e: Column): Column

    e

    hyperbolic angle

    returns

    hyperbolic sine of the given value, as if computed by java.lang.Math.sinh

    Since

    1.4.0

  349. def size(e: Column): Column

    Returns length of array or map.

    Returns length of array or map.

    The function returns null for null input if spark.sql.legacy.sizeOfNull is set to false or spark.sql.ansi.enabled is set to true. Otherwise, the function returns -1 for null input. With the default settings, the function returns -1 for null input.

    Since

    1.5.0

  350. def skewness(columnName: String): Column

    Aggregate function: returns the skewness of the values in a group.

    Aggregate function: returns the skewness of the values in a group.

    Since

    1.6.0

  351. def skewness(e: Column): Column

    Aggregate function: returns the skewness of the values in a group.

    Aggregate function: returns the skewness of the values in a group.

    Since

    1.6.0

  352. def slice(x: Column, start: Column, length: Column): Column

    Returns an array containing all the elements in x from index start (or starting from the end if start is negative) with the specified length.

    Returns an array containing all the elements in x from index start (or starting from the end if start is negative) with the specified length.

    x

    the array column to be sliced

    start

    the starting index

    length

    the length of the slice

    Since

    3.1.0

  353. def slice(x: Column, start: Int, length: Int): Column

    Returns an array containing all the elements in x from index start (or starting from the end if start is negative) with the specified length.

    Returns an array containing all the elements in x from index start (or starting from the end if start is negative) with the specified length.

    x

    the array column to be sliced

    start

    the starting index

    length

    the length of the slice

    Since

    2.4.0

  354. def sort_array(e: Column, asc: Boolean): Column

    Sorts the input array for the given column in ascending or descending order, according to the natural ordering of the array elements.

    Sorts the input array for the given column in ascending or descending order, according to the natural ordering of the array elements. NaN is greater than any non-NaN elements for double/float type. Null elements will be placed at the beginning of the returned array in ascending order or at the end of the returned array in descending order.

    Since

    1.5.0

  355. def sort_array(e: Column): Column

    Sorts the input array for the given column in ascending order, according to the natural ordering of the array elements.

    Sorts the input array for the given column in ascending order, according to the natural ordering of the array elements. Null elements will be placed at the beginning of the returned array.

    Since

    1.5.0

  356. def soundex(e: Column): Column

    Returns the soundex code for the specified expression.

    Returns the soundex code for the specified expression.

    Since

    1.5.0

  357. def spark_partition_id(): Column

    Partition ID.

    Partition ID.

    Since

    1.6.0

    Note

    This is non-deterministic because it depends on data partitioning and task scheduling.

  358. def split(str: Column, pattern: String, limit: Int): Column

    Splits str around matches of the given pattern.

    Splits str around matches of the given pattern.

    str

    a string expression to split

    pattern

    a string representing a regular expression. The regex string should be a Java regular expression.

    limit

    an integer expression which controls the number of times the regex is applied.

    • limit greater than 0: The resulting array's length will not be more than limit, and the resulting array's last entry will contain all input beyond the last matched regex.
    • limit less than or equal to 0: regex will be applied as many times as possible, and the resulting array can be of any size.
    Since

    3.0.0

  359. def split(str: Column, pattern: String): Column

    Splits str around matches of the given pattern.

    Splits str around matches of the given pattern.

    str

    a string expression to split

    pattern

    a string representing a regular expression. The regex string should be a Java regular expression.

    Since

    1.5.0

  360. def sqrt(colName: String): Column

    Computes the square root of the specified float value.

    Computes the square root of the specified float value.

    Since

    1.5.0

  361. def sqrt(e: Column): Column

    Computes the square root of the specified float value.

    Computes the square root of the specified float value.

    Since

    1.3.0

  362. def stddev(columnName: String): Column

    Aggregate function: alias for stddev_samp.

    Aggregate function: alias for stddev_samp.

    Since

    1.6.0

  363. def stddev(e: Column): Column

    Aggregate function: alias for stddev_samp.

    Aggregate function: alias for stddev_samp.

    Since

    1.6.0

  364. def stddev_pop(columnName: String): Column

    Aggregate function: returns the population standard deviation of the expression in a group.

    Aggregate function: returns the population standard deviation of the expression in a group.

    Since

    1.6.0

  365. def stddev_pop(e: Column): Column

    Aggregate function: returns the population standard deviation of the expression in a group.

    Aggregate function: returns the population standard deviation of the expression in a group.

    Since

    1.6.0

  366. def stddev_samp(columnName: String): Column

    Aggregate function: returns the sample standard deviation of the expression in a group.

    Aggregate function: returns the sample standard deviation of the expression in a group.

    Since

    1.6.0

  367. def stddev_samp(e: Column): Column

    Aggregate function: returns the sample standard deviation of the expression in a group.

    Aggregate function: returns the sample standard deviation of the expression in a group.

    Since

    1.6.0

  368. def struct(colName: String, colNames: String*): Column

    Creates a new struct column that composes multiple input columns.

    Creates a new struct column that composes multiple input columns.

    Annotations
    @varargs()
    Since

    1.4.0

  369. def struct(cols: Column*): Column

    Creates a new struct column.

    Creates a new struct column. If the input column is a column in a DataFrame, or a derived column expression that is named (i.e. aliased), its name would be retained as the StructField's name, otherwise, the newly generated StructField's name would be auto generated as col with a suffix index + 1, i.e. col1, col2, col3, ...

    Annotations
    @varargs()
    Since

    1.4.0

  370. def substring(str: Column, pos: Int, len: Int): Column

    Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type

    Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type

    Since

    1.5.0

    Note

    The position is not zero based, but 1 based index.

  371. def substring_index(str: Column, delim: String, count: Int): Column

    Returns the substring from string str before count occurrences of the delimiter delim.

    Returns the substring from string str before count occurrences of the delimiter delim. If count is positive, everything the left of the final delimiter (counting from left) is returned. If count is negative, every to the right of the final delimiter (counting from the right) is returned. substring_index performs a case-sensitive match when searching for delim.

  372. def sum(columnName: String): Column

    Aggregate function: returns the sum of all values in the given column.

    Aggregate function: returns the sum of all values in the given column.

    Since

    1.3.0

  373. def sum(e: Column): Column

    Aggregate function: returns the sum of all values in the expression.

    Aggregate function: returns the sum of all values in the expression.

    Since

    1.3.0

  374. def sum_distinct(e: Column): Column

    Aggregate function: returns the sum of distinct values in the expression.

    Aggregate function: returns the sum of distinct values in the expression.

    Since

    3.2.0

  375. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  376. def tan(columnName: String): Column

    columnName

    angle in radians

    returns

    tangent of the given value, as if computed by java.lang.Math.tan

    Since

    1.4.0

  377. def tan(e: Column): Column

    e

    angle in radians

    returns

    tangent of the given value, as if computed by java.lang.Math.tan

    Since

    1.4.0

  378. def tanh(columnName: String): Column

    columnName

    hyperbolic angle

    returns

    hyperbolic tangent of the given value, as if computed by java.lang.Math.tanh

    Since

    1.4.0

  379. def tanh(e: Column): Column

    e

    hyperbolic angle

    returns

    hyperbolic tangent of the given value, as if computed by java.lang.Math.tanh

    Since

    1.4.0

  380. def timestamp_seconds(e: Column): Column

    Converts the number of seconds from the Unix epoch (1970-01-01T00:00:00Z) to a timestamp.

    Converts the number of seconds from the Unix epoch (1970-01-01T00:00:00Z) to a timestamp.

    Since

    3.1.0

  381. def toString(): String
    Definition Classes
    AnyRef → Any
  382. def to_csv(e: Column): Column

    Converts a column containing a StructType into a CSV string with the specified schema.

    Converts a column containing a StructType into a CSV string with the specified schema. Throws an exception, in the case of an unsupported type.

    e

    a column containing a struct.

    Since

    3.0.0

  383. def to_csv(e: Column, options: Map[String, String]): Column

    (Java-specific) Converts a column containing a StructType into a CSV string with the specified schema.

    (Java-specific) Converts a column containing a StructType into a CSV string with the specified schema. Throws an exception, in the case of an unsupported type.

    e

    a column containing a struct.

    options

    options to control how the struct column is converted into a CSV string. It accepts the same options and the CSV data source. See Data Source Option in the version you use.

    Since

    3.0.0

  384. def to_date(e: Column, fmt: String): Column

    Converts the column into a DateType with a specified format

    Converts the column into a DateType with a specified format

    See Datetime Patterns for valid date and time format patterns

    e

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    fmt

    A date time pattern detailing the format of e when eis a string

    returns

    A date, or null if e was a string that could not be cast to a date or fmt was an invalid format

    Since

    2.2.0

  385. def to_date(e: Column): Column

    Converts the column into DateType by casting rules to DateType.

    Converts the column into DateType by casting rules to DateType.

    Since

    1.5.0

  386. def to_json(e: Column): Column

    Converts a column containing a StructType, ArrayType or a MapType into a JSON string with the specified schema.

    Converts a column containing a StructType, ArrayType or a MapType into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type.

    e

    a column containing a struct, an array or a map.

    Since

    2.1.0

  387. def to_json(e: Column, options: Map[String, String]): Column

    (Java-specific) Converts a column containing a StructType, ArrayType or a MapType into a JSON string with the specified schema.

    (Java-specific) Converts a column containing a StructType, ArrayType or a MapType into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type.

    e

    a column containing a struct, an array or a map.

    options

    options to control how the struct column is converted into a json string. accepts the same options and the json data source. See Data Source Option in the version you use. Additionally the function supports the pretty option which enables pretty JSON generation.

    Since

    2.1.0

  388. def to_json(e: Column, options: Map[String, String]): Column

    (Scala-specific) Converts a column containing a StructType, ArrayType or a MapType into a JSON string with the specified schema.

    (Scala-specific) Converts a column containing a StructType, ArrayType or a MapType into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type.

    e

    a column containing a struct, an array or a map.

    options

    options to control how the struct column is converted into a json string. accepts the same options and the json data source. See Data Source Option in the version you use. Additionally the function supports the pretty option which enables pretty JSON generation.

    Since

    2.1.0

  389. def to_timestamp(s: Column, fmt: String): Column

    Converts time string with the given pattern to timestamp.

    Converts time string with the given pattern to timestamp.

    See Datetime Patterns for valid date and time format patterns

    s

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    fmt

    A date time pattern detailing the format of s when s is a string

    returns

    A timestamp, or null if s was a string that could not be cast to a timestamp or fmt was an invalid format

    Since

    2.2.0

  390. def to_timestamp(s: Column): Column

    Converts to a timestamp by casting rules to TimestampType.

    Converts to a timestamp by casting rules to TimestampType.

    s

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    returns

    A timestamp, or null if the input was a string that could not be cast to a timestamp

    Since

    2.2.0

  391. def to_utc_timestamp(ts: Column, tz: Column): Column

    Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC.

    Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, 'GMT+1' would yield '2017-07-14 01:40:00.0'.

    Since

    2.4.0

  392. def to_utc_timestamp(ts: Column, tz: String): Column

    Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC.

    Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, 'GMT+1' would yield '2017-07-14 01:40:00.0'.

    ts

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    tz

    A string detailing the time zone ID that the input should be adjusted to. It should be in the format of either region-based zone IDs or zone offsets. Region IDs must have the form 'area/city', such as 'America/Los_Angeles'. Zone offsets must be in the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'. Other short names are not recommended to use because they can be ambiguous.

    returns

    A timestamp, or null if ts was a string that could not be cast to a timestamp or tz was an invalid value

    Since

    1.5.0

  393. def transform(column: Column, f: (Column, Column) ⇒ Column): Column

    Returns an array of elements after applying a transformation to each element in the input array.

    Returns an array of elements after applying a transformation to each element in the input array.

    df.select(transform(col("i"), (x, i) => x + i))
    column

    the input array column

    f

    (col, index) => transformed_col, the lambda function to filter the input column given the index. Indices start at 0.

    Since

    3.0.0

  394. def transform(column: Column, f: (Column) ⇒ Column): Column

    Returns an array of elements after applying a transformation to each element in the input array.

    Returns an array of elements after applying a transformation to each element in the input array.

    df.select(transform(col("i"), x => x + 1))
    column

    the input array column

    f

    col => transformed_col, the lambda function to transform the input column

    Since

    3.0.0

  395. def transform_keys(expr: Column, f: (Column, Column) ⇒ Column): Column

    Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new keys for the pairs.

    Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new keys for the pairs.

    df.select(transform_keys(col("i"), (k, v) => k + v))
    expr

    the input map column

    f

    (key, value) => new_key, the lambda function to transform the key of input map column

    Since

    3.0.0

  396. def transform_values(expr: Column, f: (Column, Column) ⇒ Column): Column

    Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new values for the pairs.

    Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new values for the pairs.

    df.select(transform_values(col("i"), (k, v) => k + v))
    expr

    the input map column

    f

    (key, value) => new_value, the lambda function to transform the value of input map column

    Since

    3.0.0

  397. def translate(src: Column, matchingString: String, replaceString: String): Column

    Translate any character in the src by a character in replaceString.

    Translate any character in the src by a character in replaceString. The characters in replaceString correspond to the characters in matchingString. The translate will happen when any character in the string matches the character in the matchingString.

    Since

    1.5.0

  398. def trim(e: Column, trimString: String): Column

    Trim the specified character from both ends for the specified string column.

    Trim the specified character from both ends for the specified string column.

    Since

    2.3.0

  399. def trim(e: Column): Column

    Trim the spaces from both ends for the specified string column.

    Trim the spaces from both ends for the specified string column.

    Since

    1.5.0

  400. def trunc(date: Column, format: String): Column

    Returns date truncated to the unit specified by the format.

    Returns date truncated to the unit specified by the format.

    For example, trunc("2018-11-19 12:01:19", "year") returns 2018-01-01

    date

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    returns

    A date, or null if date was a string that could not be cast to a date or format was an invalid value

    Since

    1.5.0

  401. def typedLit[T](literal: T)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[T]): Column

    Creates a Column of literal value.

    Creates a Column of literal value.

    An alias of typedlit, and it is encouraged to use typedlit directly.

    Since

    2.2.0

  402. def typedlit[T](literal: T)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[T]): Column

    Creates a Column of literal value.

    Creates a Column of literal value.

    The passed in object is returned directly if it is already a Column. If the object is a Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value. The difference between this function and lit is that this function can handle parameterized scala types e.g.: List, Seq and Map.

    Since

    3.2.0

    Note

    typedlit will call expensive Scala reflection APIs. lit is preferred if parameterized Scala types are not used.

  403. def udaf[IN, BUF, OUT](agg: expressions.Aggregator[IN, BUF, OUT], inputEncoder: Encoder[IN]): UserDefinedFunction

    Obtains a UserDefinedFunction that wraps the given Aggregator so that it may be used with untyped Data Frames.

    Obtains a UserDefinedFunction that wraps the given Aggregator so that it may be used with untyped Data Frames.

    Aggregator<IN, BUF, OUT> agg = // custom Aggregator
    Encoder<IN> enc = // input encoder
    
    // declare a UDF based on agg
    UserDefinedFunction aggUDF = udaf(agg, enc)
    DataFrame aggData = df.agg(aggUDF($"colname"))
    
    // register agg as a named function
    spark.udf.register("myAggName", udaf(agg, enc))
    IN

    the aggregator input type

    BUF

    the aggregating buffer type

    OUT

    the finalized output type

    agg

    the typed Aggregator

    inputEncoder

    a specific input encoder to use

    returns

    a UserDefinedFunction that can be used as an aggregating expression

    Note

    This overloading takes an explicit input encoder, to support UDAF declarations in Java.

  404. def udaf[IN, BUF, OUT](agg: expressions.Aggregator[IN, BUF, OUT])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[IN]): UserDefinedFunction

    Obtains a UserDefinedFunction that wraps the given Aggregator so that it may be used with untyped Data Frames.

    Obtains a UserDefinedFunction that wraps the given Aggregator so that it may be used with untyped Data Frames.

    val agg = // Aggregator[IN, BUF, OUT]
    
    // declare a UDF based on agg
    val aggUDF = udaf(agg)
    val aggData = df.agg(aggUDF($"colname"))
    
    // register agg as a named function
    spark.udf.register("myAggName", udaf(agg))
    IN

    the aggregator input type

    BUF

    the aggregating buffer type

    OUT

    the finalized output type

    agg

    the typed Aggregator

    returns

    a UserDefinedFunction that can be used as an aggregating expression.

    Note

    The input encoder is inferred from the input type IN.

  405. def udf(f: UDF10[_, _, _, _, _, _, _, _, _, _, _], returnType: DataType): UserDefinedFunction

    Defines a Java UDF10 instance as user-defined function (UDF).

    Defines a Java UDF10 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    2.3.0

  406. def udf(f: UDF9[_, _, _, _, _, _, _, _, _, _], returnType: DataType): UserDefinedFunction

    Defines a Java UDF9 instance as user-defined function (UDF).

    Defines a Java UDF9 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    2.3.0

  407. def udf(f: UDF8[_, _, _, _, _, _, _, _, _], returnType: DataType): UserDefinedFunction

    Defines a Java UDF8 instance as user-defined function (UDF).

    Defines a Java UDF8 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    2.3.0

  408. def udf(f: UDF7[_, _, _, _, _, _, _, _], returnType: DataType): UserDefinedFunction

    Defines a Java UDF7 instance as user-defined function (UDF).

    Defines a Java UDF7 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    2.3.0

  409. def udf(f: UDF6[_, _, _, _, _, _, _], returnType: DataType): UserDefinedFunction

    Defines a Java UDF6 instance as user-defined function (UDF).

    Defines a Java UDF6 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    2.3.0

  410. def udf(f: UDF5[_, _, _, _, _, _], returnType: DataType): UserDefinedFunction

    Defines a Java UDF5 instance as user-defined function (UDF).

    Defines a Java UDF5 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    2.3.0

  411. def udf(f: UDF4[_, _, _, _, _], returnType: DataType): UserDefinedFunction

    Defines a Java UDF4 instance as user-defined function (UDF).

    Defines a Java UDF4 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    2.3.0

  412. def udf(f: UDF3[_, _, _, _], returnType: DataType): UserDefinedFunction

    Defines a Java UDF3 instance as user-defined function (UDF).

    Defines a Java UDF3 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    2.3.0

  413. def udf(f: UDF2[_, _, _], returnType: DataType): UserDefinedFunction

    Defines a Java UDF2 instance as user-defined function (UDF).

    Defines a Java UDF2 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    2.3.0

  414. def udf(f: UDF1[_, _], returnType: DataType): UserDefinedFunction

    Defines a Java UDF1 instance as user-defined function (UDF).

    Defines a Java UDF1 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    2.3.0

  415. def udf(f: UDF0[_], returnType: DataType): UserDefinedFunction

    Defines a Java UDF0 instance as user-defined function (UDF).

    Defines a Java UDF0 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    2.3.0

  416. def udf[RT, A1, A2, A3, A4, A5, A6, A7, A8, A9, A10](f: (A1, A2, A3, A4, A5, A6, A7, A8, A9, A10) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5], arg6: scala.reflect.api.JavaUniverse.TypeTag[A6], arg7: scala.reflect.api.JavaUniverse.TypeTag[A7], arg8: scala.reflect.api.JavaUniverse.TypeTag[A8], arg9: scala.reflect.api.JavaUniverse.TypeTag[A9], arg10: scala.reflect.api.JavaUniverse.TypeTag[A10]): UserDefinedFunction

    Defines a Scala closure of 10 arguments as user-defined function (UDF).

    Defines a Scala closure of 10 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    1.3.0

  417. def udf[RT, A1, A2, A3, A4, A5, A6, A7, A8, A9](f: (A1, A2, A3, A4, A5, A6, A7, A8, A9) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5], arg6: scala.reflect.api.JavaUniverse.TypeTag[A6], arg7: scala.reflect.api.JavaUniverse.TypeTag[A7], arg8: scala.reflect.api.JavaUniverse.TypeTag[A8], arg9: scala.reflect.api.JavaUniverse.TypeTag[A9]): UserDefinedFunction

    Defines a Scala closure of 9 arguments as user-defined function (UDF).

    Defines a Scala closure of 9 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    1.3.0

  418. def udf[RT, A1, A2, A3, A4, A5, A6, A7, A8](f: (A1, A2, A3, A4, A5, A6, A7, A8) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5], arg6: scala.reflect.api.JavaUniverse.TypeTag[A6], arg7: scala.reflect.api.JavaUniverse.TypeTag[A7], arg8: scala.reflect.api.JavaUniverse.TypeTag[A8]): UserDefinedFunction

    Defines a Scala closure of 8 arguments as user-defined function (UDF).

    Defines a Scala closure of 8 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    1.3.0

  419. def udf[RT, A1, A2, A3, A4, A5, A6, A7](f: (A1, A2, A3, A4, A5, A6, A7) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5], arg6: scala.reflect.api.JavaUniverse.TypeTag[A6], arg7: scala.reflect.api.JavaUniverse.TypeTag[A7]): UserDefinedFunction

    Defines a Scala closure of 7 arguments as user-defined function (UDF).

    Defines a Scala closure of 7 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    1.3.0

  420. def udf[RT, A1, A2, A3, A4, A5, A6](f: (A1, A2, A3, A4, A5, A6) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5], arg6: scala.reflect.api.JavaUniverse.TypeTag[A6]): UserDefinedFunction

    Defines a Scala closure of 6 arguments as user-defined function (UDF).

    Defines a Scala closure of 6 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    1.3.0

  421. def udf[RT, A1, A2, A3, A4, A5](f: (A1, A2, A3, A4, A5) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5]): UserDefinedFunction

    Defines a Scala closure of 5 arguments as user-defined function (UDF).

    Defines a Scala closure of 5 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    1.3.0

  422. def udf[RT, A1, A2, A3, A4](f: (A1, A2, A3, A4) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4]): UserDefinedFunction

    Defines a Scala closure of 4 arguments as user-defined function (UDF).

    Defines a Scala closure of 4 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    1.3.0

  423. def udf[RT, A1, A2, A3](f: (A1, A2, A3) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3]): UserDefinedFunction

    Defines a Scala closure of 3 arguments as user-defined function (UDF).

    Defines a Scala closure of 3 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    1.3.0

  424. def udf[RT, A1, A2](f: (A1, A2) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2]): UserDefinedFunction

    Defines a Scala closure of 2 arguments as user-defined function (UDF).

    Defines a Scala closure of 2 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    1.3.0

  425. def udf[RT, A1](f: (A1) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1]): UserDefinedFunction

    Defines a Scala closure of 1 arguments as user-defined function (UDF).

    Defines a Scala closure of 1 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    1.3.0

  426. def udf[RT](f: () ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT]): UserDefinedFunction

    Defines a Scala closure of 0 arguments as user-defined function (UDF).

    Defines a Scala closure of 0 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Since

    1.3.0

  427. def unbase64(e: Column): Column

    Decodes a BASE64 encoded string column and returns it as a binary column.

    Decodes a BASE64 encoded string column and returns it as a binary column. This is the reverse of base64.

    Since

    1.5.0

  428. def unhex(column: Column): Column

    Inverse of hex.

    Inverse of hex. Interprets each pair of characters as a hexadecimal number and converts to the byte representation of number.

    Since

    1.5.0

  429. def unix_timestamp(s: Column, p: String): Column

    Converts time string with given pattern to Unix timestamp (in seconds).

    Converts time string with given pattern to Unix timestamp (in seconds).

    See Datetime Patterns for valid date and time format patterns

    s

    A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS

    p

    A date time pattern detailing the format of s when s is a string

    returns

    A long, or null if s was a string that could not be cast to a date or p was an invalid format

    Since

    1.5.0

  430. def unix_timestamp(s: Column): Column

    Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale.

    Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale.

    s

    A date, timestamp or string. If a string, the data must be in the yyyy-MM-dd HH:mm:ss format

    returns

    A long, or null if the input was a string not of the correct format

    Since

    1.5.0

  431. def unix_timestamp(): Column

    Returns the current Unix timestamp (in seconds) as a long.

    Returns the current Unix timestamp (in seconds) as a long.

    Since

    1.5.0

    Note

    All calls of unix_timestamp within the same query return the same value (i.e. the current timestamp is calculated at the start of query evaluation).

  432. def unwrap_udt(column: Column): Column

    Unwrap UDT data type column into its underlying type.

    Unwrap UDT data type column into its underlying type.

    Since

    3.4.0

  433. def upper(e: Column): Column

    Converts a string column to upper case.

    Converts a string column to upper case.

    Since

    1.3.0

  434. def var_pop(columnName: String): Column

    Aggregate function: returns the population variance of the values in a group.

    Aggregate function: returns the population variance of the values in a group.

    Since

    1.6.0

  435. def var_pop(e: Column): Column

    Aggregate function: returns the population variance of the values in a group.

    Aggregate function: returns the population variance of the values in a group.

    Since

    1.6.0

  436. def var_samp(columnName: String): Column

    Aggregate function: returns the unbiased variance of the values in a group.

    Aggregate function: returns the unbiased variance of the values in a group.

    Since

    1.6.0

  437. def var_samp(e: Column): Column

    Aggregate function: returns the unbiased variance of the values in a group.

    Aggregate function: returns the unbiased variance of the values in a group.

    Since

    1.6.0

  438. def variance(columnName: String): Column

    Aggregate function: alias for var_samp.

    Aggregate function: alias for var_samp.

    Since

    1.6.0

  439. def variance(e: Column): Column

    Aggregate function: alias for var_samp.

    Aggregate function: alias for var_samp.

    Since

    1.6.0

  440. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  441. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  442. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  443. def weekofyear(e: Column): Column

    Extracts the week number as an integer from a given date/timestamp/string.

    Extracts the week number as an integer from a given date/timestamp/string.

    A week is considered to start on a Monday and week 1 is the first week with more than 3 days, as defined by ISO 8601

    returns

    An integer, or null if the input was a string that could not be cast to a date

    Since

    1.5.0

  444. def when(condition: Column, value: Any): Column

    Evaluates a list of conditions and returns one of multiple possible result expressions.

    Evaluates a list of conditions and returns one of multiple possible result expressions. If otherwise is not defined at the end, null is returned for unmatched conditions.

    // Example: encoding gender string column into integer.
    
    // Scala:
    people.select(when(people("gender") === "male", 0)
      .when(people("gender") === "female", 1)
      .otherwise(2))
    
    // Java:
    people.select(when(col("gender").equalTo("male"), 0)
      .when(col("gender").equalTo("female"), 1)
      .otherwise(2))
    Since

    1.4.0

  445. def window(timeColumn: Column, windowDuration: String): Column

    Generates tumbling time windows given a timestamp specifying column.

    Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC. The following example takes the average stock price for a one minute tumbling window:

    val df = ... // schema => timestamp: TimestampType, stockId: StringType, price: DoubleType
    df.groupBy(window($"timestamp", "1 minute"), $"stockId")
      .agg(mean("price"))

    The windows will look like:

    09:00:00-09:01:00
    09:01:00-09:02:00
    09:02:00-09:03:00 ...

    For a streaming query, you may use the function current_timestamp to generate windows on processing time.

    timeColumn

    The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType.

    windowDuration

    A string specifying the width of the window, e.g. 10 minutes, 1 second. Check org.apache.spark.unsafe.types.CalendarInterval for valid duration identifiers.

    Since

    2.0.0

  446. def window(timeColumn: Column, windowDuration: String, slideDuration: String): Column

    Bucketize rows into one or more time windows given a timestamp specifying column.

    Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC. The following example takes the average stock price for a one minute window every 10 seconds:

    val df = ... // schema => timestamp: TimestampType, stockId: StringType, price: DoubleType
    df.groupBy(window($"timestamp", "1 minute", "10 seconds"), $"stockId")
      .agg(mean("price"))

    The windows will look like:

    09:00:00-09:01:00
    09:00:10-09:01:10
    09:00:20-09:01:20 ...

    For a streaming query, you may use the function current_timestamp to generate windows on processing time.

    timeColumn

    The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType.

    windowDuration

    A string specifying the width of the window, e.g. 10 minutes, 1 second. Check org.apache.spark.unsafe.types.CalendarInterval for valid duration identifiers. Note that the duration is a fixed length of time, and does not vary over time according to a calendar. For example, 1 day always means 86,400,000 milliseconds, not a calendar day.

    slideDuration

    A string specifying the sliding interval of the window, e.g. 1 minute. A new window will be generated every slideDuration. Must be less than or equal to the windowDuration. Check org.apache.spark.unsafe.types.CalendarInterval for valid duration identifiers. This duration is likewise absolute, and does not vary according to a calendar.

    Since

    2.0.0

  447. def window(timeColumn: Column, windowDuration: String, slideDuration: String, startTime: String): Column

    Bucketize rows into one or more time windows given a timestamp specifying column.

    Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The following example takes the average stock price for a one minute window every 10 seconds starting 5 seconds after the hour:

    val df = ... // schema => timestamp: TimestampType, stockId: StringType, price: DoubleType
    df.groupBy(window($"timestamp", "1 minute", "10 seconds", "5 seconds"), $"stockId")
      .agg(mean("price"))

    The windows will look like:

    09:00:05-09:01:05
    09:00:15-09:01:15
    09:00:25-09:01:25 ...

    For a streaming query, you may use the function current_timestamp to generate windows on processing time.

    timeColumn

    The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType.

    windowDuration

    A string specifying the width of the window, e.g. 10 minutes, 1 second. Check org.apache.spark.unsafe.types.CalendarInterval for valid duration identifiers. Note that the duration is a fixed length of time, and does not vary over time according to a calendar. For example, 1 day always means 86,400,000 milliseconds, not a calendar day.

    slideDuration

    A string specifying the sliding interval of the window, e.g. 1 minute. A new window will be generated every slideDuration. Must be less than or equal to the windowDuration. Check org.apache.spark.unsafe.types.CalendarInterval for valid duration identifiers. This duration is likewise absolute, and does not vary according to a calendar.

    startTime

    The offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes.

    Since

    2.0.0

  448. def window_time(windowColumn: Column): Column

    Extracts the event time from the window column.

    Extracts the event time from the window column.

    The window column is of StructType { start: Timestamp, end: Timestamp } where start is inclusive and end is exclusive. Since event time can support microsecond precision, window_time(window) = window.end - 1 microsecond.

    windowColumn

    The window column (typically produced by window aggregation) of type StructType { start: Timestamp, end: Timestamp }

    Since

    3.4.0

  449. def xxhash64(cols: Column*): Column

    Calculates the hash code of given columns using the 64-bit variant of the xxHash algorithm, and returns the result as a long column.

    Calculates the hash code of given columns using the 64-bit variant of the xxHash algorithm, and returns the result as a long column. The hash computation uses an initial seed of 42.

    Annotations
    @varargs()
    Since

    3.0.0

  450. def year(e: Column): Column

    Extracts the year as an integer from a given date/timestamp/string.

    Extracts the year as an integer from a given date/timestamp/string.

    returns

    An integer, or null if the input was a string that could not be cast to a date

    Since

    1.5.0

  451. def years(e: Column): Column

    A transform for timestamps and dates to partition data into years.

    A transform for timestamps and dates to partition data into years.

    Since

    3.0.0

  452. def zip_with(left: Column, right: Column, f: (Column, Column) ⇒ Column): Column

    Merge two given arrays, element-wise, into a single array using a function.

    Merge two given arrays, element-wise, into a single array using a function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying the function.

    df.select(zip_with(df1("val1"), df1("val2"), (x, y) => x + y))
    left

    the left input array column

    right

    the right input array column

    f

    (lCol, rCol) => col, the lambda function to merge two input columns into one column

    Since

    3.0.0

Deprecated Value Members

  1. def approxCountDistinct(columnName: String, rsd: Double): Column

    Annotations
    @deprecated
    Deprecated

    (Since version 2.1.0) Use approx_count_distinct

    Since

    1.3.0

  2. def approxCountDistinct(e: Column, rsd: Double): Column

    Annotations
    @deprecated
    Deprecated

    (Since version 2.1.0) Use approx_count_distinct

    Since

    1.3.0

  3. def approxCountDistinct(columnName: String): Column

    Annotations
    @deprecated
    Deprecated

    (Since version 2.1.0) Use approx_count_distinct

    Since

    1.3.0

  4. def approxCountDistinct(e: Column): Column

    Annotations
    @deprecated
    Deprecated

    (Since version 2.1.0) Use approx_count_distinct

    Since

    1.3.0

  5. def bitwiseNOT(e: Column): Column

    Computes bitwise NOT (~) of a number.

    Computes bitwise NOT (~) of a number.

    Annotations
    @deprecated
    Deprecated

    (Since version 3.2.0) Use bitwise_not

    Since

    1.4.0

  6. def callUDF(udfName: String, cols: Column*): Column

    Call an user-defined function.

    Call an user-defined function.

    Annotations
    @varargs() @deprecated
    Deprecated

    Use call_udf

    Since

    1.5.0

  7. def monotonicallyIncreasingId(): Column

    A column expression that generates monotonically increasing 64-bit integers.

    A column expression that generates monotonically increasing 64-bit integers.

    The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

    As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs:

    0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
    Annotations
    @deprecated
    Deprecated

    (Since version 2.0.0) Use monotonically_increasing_id()

    Since

    1.4.0

  8. def shiftLeft(e: Column, numBits: Int): Column

    Shift the given value numBits left.

    Shift the given value numBits left. If the given value is a long value, this function will return a long value else it will return an integer value.

    Annotations
    @deprecated
    Deprecated

    (Since version 3.2.0) Use shiftleft

    Since

    1.5.0

  9. def shiftRight(e: Column, numBits: Int): Column

    (Signed) shift the given value numBits right.

    (Signed) shift the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value.

    Annotations
    @deprecated
    Deprecated

    (Since version 3.2.0) Use shiftright

    Since

    1.5.0

  10. def shiftRightUnsigned(e: Column, numBits: Int): Column

    Unsigned shift the given value numBits right.

    Unsigned shift the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value.

    Annotations
    @deprecated
    Deprecated

    (Since version 3.2.0) Use shiftrightunsigned

    Since

    1.5.0

  11. def sumDistinct(columnName: String): Column

    Aggregate function: returns the sum of distinct values in the expression.

    Aggregate function: returns the sum of distinct values in the expression.

    Annotations
    @deprecated
    Deprecated

    (Since version 3.2.0) Use sum_distinct

    Since

    1.3.0

  12. def sumDistinct(e: Column): Column

    Aggregate function: returns the sum of distinct values in the expression.

    Aggregate function: returns the sum of distinct values in the expression.

    Annotations
    @deprecated
    Deprecated

    (Since version 3.2.0) Use sum_distinct

    Since

    1.3.0

  13. def toDegrees(columnName: String): Column

    Annotations
    @deprecated
    Deprecated

    (Since version 2.1.0) Use degrees

    Since

    1.4.0

  14. def toDegrees(e: Column): Column

    Annotations
    @deprecated
    Deprecated

    (Since version 2.1.0) Use degrees

    Since

    1.4.0

  15. def toRadians(columnName: String): Column

    Annotations
    @deprecated
    Deprecated

    (Since version 2.1.0) Use radians

    Since

    1.4.0

  16. def toRadians(e: Column): Column

    Annotations
    @deprecated
    Deprecated

    (Since version 2.1.0) Use radians

    Since

    1.4.0

  17. def udf(f: AnyRef, dataType: DataType): UserDefinedFunction

    Defines a deterministic user-defined function (UDF) using a Scala closure.

    Defines a deterministic user-defined function (UDF) using a Scala closure. For this variant, the caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().

    Note that, although the Scala closure can have primitive-type function argument, it doesn't work well with null values. Because the Scala closure is passed in as Any type, there is no type information for the function arguments. Without the type information, Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. udf((x: Int) => x, IntegerType), the result is 0 for null input.

    f

    A closure in Scala

    dataType

    The output data type of the UDF

    Annotations
    @deprecated
    Deprecated

    (Since version 3.0.0)

    Since

    2.0.0

Inherited from AnyRef

Inherited from Any

Aggregate functions

Collection functions

Date time functions

Math functions

Misc functions

Non-aggregate functions

Partition transform functions

Sorting functions

String functions

UDF functions

Window functions

Support functions for DataFrames