pyspark.sql.functions.count_min_sketch(col: ColumnOrName, eps: ColumnOrName, confidence: ColumnOrName, seed: ColumnOrName) → pyspark.sql.column.Column[source]

Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.

New in version 3.5.0.

colColumn or str

target column to compute on.

epsColumn or str

relative error, must be positive

confidenceColumn or str

confidence, must be positive and less than 1.0

seedColumn or str

random seed


count-min sketch of the column


>>> df = spark.createDataFrame([[1], [2], [1]], ['data'])
>>> df = df.agg(count_min_sketch(, lit(0.5), lit(0.5), lit(1)).alias('sketch'))