pyspark.sql.functions.histogram_numeric

pyspark.sql.functions.histogram_numeric(col: ColumnOrName, nBins: ColumnOrName) → pyspark.sql.column.Column[source]

Computes a histogram on numeric ‘col’ using nb bins. The return value is an array of (x,y) pairs representing the centers of the histogram’s bins. As the value of ‘nb’ is increased, the histogram approximation gets finer-grained, but may yield artifacts around outliers. In practice, 20-40 histogram bins appear to work well, with more bins being required for skewed or smaller datasets. Note that this function creates a histogram with non-uniform bin widths. It offers no guarantees in terms of the mean-squared-error of the histogram, but in practice is comparable to the histograms produced by the R/S-Plus statistical computing packages. Note: the output type of the ‘x’ field in the return value is propagated from the input value consumed in the aggregate function.

New in version 3.5.0.

Parameters
colColumn or str

target column to work on.

nBinsColumn or str

number of Histogram columns.

Returns
Column

a histogram on numeric ‘col’ using nb bins.

Examples

>>> df = spark.createDataFrame([("a", 1),
...                             ("a", 2),
...                             ("a", 3),
...                             ("b", 8),
...                             ("b", 2)], ["c1", "c2"])
>>> df.select(histogram_numeric('c2', lit(5))).show()
+------------------------+
|histogram_numeric(c2, 5)|
+------------------------+
|    [{1, 1.0}, {2, 1....|
+------------------------+