pyspark.sql.functions.tuple_intersection_agg_integer#

pyspark.sql.functions.tuple_intersection_agg_integer(col, mode=None)[source]#

Aggregate function: returns the compact binary representation of the Datasketches TupleSketch that is the intersection of the integer TupleSketch objects in the input column.

New in version 4.2.0.

Parameters
colColumn or column name

The column containing binary TupleSketch representations

modeColumn or str, optional

The summary mode: “sum” (default), “min”, “max”, or “alwaysone”

Returns
Column

The binary representation of the intersected TupleSketch.

Examples

>>> from pyspark.sql import functions as sf
>>> df1 = spark.createDataFrame([(1, 10), (2, 20), (3, 30)], ["key", "value"])
>>> df1 = df1.agg(sf.tuple_sketch_agg_integer("key", "value").alias("sketch"))
>>> df2 = spark.createDataFrame([(2, 40), (3, 50), (4, 60)], ["key", "value"])
>>> df2 = df2.agg(sf.tuple_sketch_agg_integer("key", "value").alias("sketch"))
>>> df3 = df1.union(df2)
>>> df3.agg(sf.tuple_sketch_estimate_integer(sf.tuple_intersection_agg_integer("sketch"))).show()
+--------------------------------------------------------------------------+
|tuple_sketch_estimate_integer(tuple_intersection_agg_integer(sketch, sum))|
+--------------------------------------------------------------------------+
|                                                                       2.0|
+--------------------------------------------------------------------------+