pyspark.sql.functions.count_min_sketch#

pyspark.sql.functions.count_min_sketch(col, eps, confidence, seed)[source]#

Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.

New in version 3.5.0.

Parameters
colColumn or str

target column to compute on.

epsColumn or str

relative error, must be positive

confidenceColumn or str

confidence, must be positive and less than 1.0

seedColumn or str

random seed

Returns
Column

count-min sketch of the column

Examples

>>> df = spark.createDataFrame([[1], [2], [1]], ['data'])
>>> df = df.agg(count_min_sketch(df.data, lit(0.5), lit(0.5), lit(1)).alias('sketch'))
>>> df.select(hex(df.sketch).alias('r')).collect()
[Row(r='0000000100000000000000030000000100000004000000005D8D6AB90000000000000000000000000000000200000000000000010000000000000000')]