pyspark.sql.functions.listagg_distinct#

pyspark.sql.functions.listagg_distinct(col, delimiter=None)[source]#

Aggregate function: returns the concatenation of distinct non-null input values, separated by the delimiter.

New in version 4.0.0.

Parameters

colColumn or column name: target column to compute on.
delimiterColumn, literal string or bytes, optional: the delimiter to separate the values. The default value is None.

Returns

Column: the column for computed results.

Examples

Example 1: Using listagg_distinct function

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([('a',), ('b',), (None,), ('c',), ('b',)], ['strings'])
>>> df.select(sf.listagg_distinct('strings')).show()
+-------------------------------+
|listagg(DISTINCT strings, NULL)|
+-------------------------------+
|                            abc|
+-------------------------------+

Example 2: Using listagg_distinct function with a delimiter

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([('a',), ('b',), (None,), ('c',), ('b',)], ['strings'])
>>> df.select(sf.listagg_distinct('strings', ', ')).show()
+-----------------------------+
|listagg(DISTINCT strings, , )|
+-----------------------------+
|                      a, b, c|
+-----------------------------+

Example 3: Using listagg_distinct function with a binary column and delimiter

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([(b'',), (b'',), (None,), (b'',), (b'',)],
...                            ['bytes'])
>>> df.select(sf.listagg_distinct('bytes', b'B')).show()
+------------------------------+
|listagg(DISTINCT bytes, X'42')|
+------------------------------+
|              [01 42 02 42 03]|
+------------------------------+

Example 4: Using listagg_distinct function on a column with all None values

>>> from pyspark.sql import functions as sf
>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField("strings", StringType(), True)])
>>> df = spark.createDataFrame([(None,), (None,), (None,), (None,)], schema=schema)
>>> df.select(sf.listagg_distinct('strings')).show()
+-------------------------------+
|listagg(DISTINCT strings, NULL)|
+-------------------------------+
|                           NULL|
+-------------------------------+