pyspark.sql.functions.hll_union_agg

pyspark.sql.functions.hll_union_agg(col: ColumnOrName, allowDifferentLgConfigK: Union[bool, pyspark.sql.column.Column, None] = None) → pyspark.sql.column.Column[source]

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance. Throws an exception if sketches have different lgConfigK values and allowDifferentLgConfigK is unset or set to false.

New in version 3.5.0.

Parameters
colColumn or str or bool
allowDifferentLgConfigKbool, optional

Allow sketches with different lgConfigK values to be merged (defaults to false).

Returns
Column

The binary representation of the merged HllSketch.

Examples

>>> df1 = spark.createDataFrame([1,2,2,3], "INT")
>>> df1 = df1.agg(hll_sketch_agg("value").alias("sketch"))
>>> df2 = spark.createDataFrame([4,5,5,6], "INT")
>>> df2 = df2.agg(hll_sketch_agg("value").alias("sketch"))
>>> df3 = df1.union(df2).agg(hll_sketch_estimate(
...     hll_union_agg("sketch")
... ).alias("distinct_cnt"))
>>> df3.drop("sketch").show()
+------------+
|distinct_cnt|
+------------+
|           6|
+------------+
>>> df4 = df1.union(df2).agg(hll_sketch_estimate(
...     hll_union_agg("sketch", lit(False))
... ).alias("distinct_cnt"))
>>> df4.drop("sketch").show()
+------------+
|distinct_cnt|
+------------+
|           6|
+------------+
>>> df5 = df1.union(df2).agg(hll_sketch_estimate(
...     hll_union_agg(col("sketch"), lit(False))
... ).alias("distinct_cnt"))
>>> df5.drop("sketch").show()
+------------+
|distinct_cnt|
+------------+
|           6|
+------------+