pyspark.sql.functions.percentile_approx

pyspark.sql.functions.percentile_approx(col: ColumnOrName, percentage: Union[pyspark.sql.column.Column, float, List[float], Tuple[float]], accuracy: Union[pyspark.sql.column.Column, float] = 10000) → pyspark.sql.column.Column[source]

Returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value.

New in version 3.1.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
colColumn or str

input column.

percentageColumn, float, list of floats or tuple of floats

percentage in decimal (must be between 0.0 and 1.0). When percentage is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column col at the given percentage array.

accuracyColumn or float

is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation. (default: 10000).

Returns
Column

approximate percentile of the numeric column.

Examples

>>> key = (col("id") % 3).alias("key")
>>> value = (randn(42) + key * 10).alias("value")
>>> df = spark.range(0, 1000, 1, 1).select(key, value)
>>> df.select(
...     percentile_approx("value", [0.25, 0.5, 0.75], 1000000).alias("quantiles")
... ).printSchema()
root
 |-- quantiles: array (nullable = true)
 |    |-- element: double (containsNull = false)
>>> df.groupBy("key").agg(
...     percentile_approx("value", 0.5, lit(1000000)).alias("median")
... ).printSchema()
root
 |-- key: long (nullable = true)
 |-- median: double (nullable = true)