pyspark.sql.functions.tuple_difference_double#

pyspark.sql.functions.tuple_difference_double(col1, col2)[source]#

Returns the set difference of two Datasketches TupleSketch objects with double summaries (elements in first sketch but not in second).

New in version 4.2.0.

Parameters
col1Column or column name

The first TupleSketch column

col2Column or column name

The second TupleSketch column

Returns
Column

The binary representation of the difference TupleSketch.

Examples

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([(1, 10.0, 4, 40.0), (2, 20.0, 4, 40.0), (3, 30.0, 5, 50.0), (4, 40.0, 5, 50.0)], ["key1", "v1", "key2", "v2"])  # noqa
>>> df = df.agg(
...     sf.tuple_sketch_agg_double("key1", "v1").alias("sketch1"),
...     sf.tuple_sketch_agg_double("key2", "v2").alias("sketch2")
... )
>>> df.select(sf.tuple_sketch_estimate_double(sf.tuple_difference_double(df.sketch1, "sketch2"))).show()  # noqa
+-----------------------------------------------------------------------+
|tuple_sketch_estimate_double(tuple_difference_double(sketch1, sketch2))|
+-----------------------------------------------------------------------+
|                                                                    3.0|
+-----------------------------------------------------------------------+