pyspark.sql.functions.array_distinct#
- pyspark.sql.functions.array_distinct(col)[source]#
- Array function: removes duplicate values from the array. - New in version 2.4.0. - Changed in version 3.4.0: Supports Spark Connect. - Parameters
- colColumnor str
- name of column or expression 
 
- col
- Returns
- Column
- A new column that is an array of unique values from the input column. 
 
 - Examples - Example 1: Removing duplicate values from a simple array - >>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([([1, 2, 3, 2],)], ['data']) >>> df.select(sf.array_distinct(df.data)).show() +--------------------+ |array_distinct(data)| +--------------------+ | [1, 2, 3]| +--------------------+ - Example 2: Removing duplicate values from multiple arrays - >>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([([1, 2, 3, 2],), ([4, 5, 5, 4],)], ['data']) >>> df.select(sf.array_distinct(df.data)).show() +--------------------+ |array_distinct(data)| +--------------------+ | [1, 2, 3]| | [4, 5]| +--------------------+ - Example 3: Removing duplicate values from an array with all identical values - >>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([([1, 1, 1],)], ['data']) >>> df.select(sf.array_distinct(df.data)).show() +--------------------+ |array_distinct(data)| +--------------------+ | [1]| +--------------------+ - Example 4: Removing duplicate values from an array with no duplicate values - >>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([([1, 2, 3],)], ['data']) >>> df.select(sf.array_distinct(df.data)).show() +--------------------+ |array_distinct(data)| +--------------------+ | [1, 2, 3]| +--------------------+ - Example 5: Removing duplicate values from an empty array - >>> from pyspark.sql import functions as sf >>> from pyspark.sql.types import ArrayType, IntegerType, StructType, StructField >>> schema = StructType([ ... StructField("data", ArrayType(IntegerType()), True) ... ]) >>> df = spark.createDataFrame([([],)], schema) >>> df.select(sf.array_distinct(df.data)).show() +--------------------+ |array_distinct(data)| +--------------------+ | []| +--------------------+