pyspark.sql.functions.hll_union

pyspark.sql.functions.hll_union(col1: ColumnOrName, col2: ColumnOrName, allowDifferentLgConfigK: Optional[bool] = None) → pyspark.sql.column.Column[source]

Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object. Throws an exception if sketches have different lgConfigK values and allowDifferentLgConfigK is unset or set to false.

New in version 3.5.0.

Parameters
col1Column or str
col2Column or str
allowDifferentLgConfigKbool, optional

Allow sketches with different lgConfigK values to be merged (defaults to false).

Returns
Column

The binary representation of the merged HllSketch.

Examples

>>> df = spark.createDataFrame([(1,4),(2,5),(2,5),(3,6)], "struct<v1:int,v2:int>")
>>> df = df.agg(hll_sketch_agg("v1").alias("sketch1"), hll_sketch_agg("v2").alias("sketch2"))
>>> df = df.withColumn("distinct_cnt", hll_sketch_estimate(hll_union("sketch1", "sketch2")))
>>> df.drop("sketch1", "sketch2").show()
+------------+
|distinct_cnt|
+------------+
|           6|
+------------+