pyspark.sql.functions.count_distinct

pyspark.sql.functions.count_distinct(col: ColumnOrName, *cols: ColumnOrName) → pyspark.sql.column.Column[source]

Returns a new Column for distinct count of col or cols.

New in version 3.2.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
colColumn or str

first column to compute on.

colsColumn or str

other columns to compute on.

Returns
Column

distinct values of these two column values.

Examples

>>> from pyspark.sql import types
>>> df1 = spark.createDataFrame([1, 1, 3], types.IntegerType())
>>> df2 = spark.createDataFrame([1, 2], types.IntegerType())
>>> df1.join(df2).show()
+-----+-----+
|value|value|
+-----+-----+
|    1|    1|
|    1|    2|
|    1|    1|
|    1|    2|
|    3|    1|
|    3|    2|
+-----+-----+
>>> df1.join(df2).select(count_distinct(df1.value, df2.value)).show()
+----------------------------+
|count(DISTINCT value, value)|
+----------------------------+
|                           4|
+----------------------------+