pyspark.sql.functions.collect_set

pyspark.sql.functions.collect_set(col: ColumnOrName) → pyspark.sql.column.Column[source]

Aggregate function: returns a set of objects with duplicate elements eliminated.

New in version 1.6.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
colColumn or str

target column to compute on.

Returns
Column

list of objects with no duplicates.

Notes

The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

Examples

>>> df2 = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
>>> df2.agg(array_sort(collect_set('age')).alias('c')).collect()
[Row(c=[2, 5])]