pyspark.sql.functions.approx_count_distinct#
- pyspark.sql.functions.approx_count_distinct(col, rsd=None)[source]#
This aggregate function returns a new
Column
, which estimates the approximate distinct count of elements in a specified column or a group of columns.New in version 2.1.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- col
Column
or str The label of the column to count distinct values in.
- rsdfloat, optional
The maximum allowed relative standard deviation (default = 0.05). If rsd < 0.01, it would be more efficient to use
count_distinct()
.
- col
- Returns
Column
A new Column object representing the approximate unique count.
Examples
Example 1: Counting distinct values in a single column DataFrame representing integers
>>> from pyspark.sql.functions import approx_count_distinct >>> df = spark.createDataFrame([1,2,2,3], "int") >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show() +---------------+ |distinct_values| +---------------+ | 3| +---------------+
Example 2: Counting distinct values in a single column DataFrame representing strings
>>> from pyspark.sql.functions import approx_count_distinct >>> df = spark.createDataFrame([("apple",), ("orange",), ("apple",), ("banana",)], ['fruit']) >>> df.agg(approx_count_distinct("fruit").alias('distinct_fruits')).show() +---------------+ |distinct_fruits| +---------------+ | 3| +---------------+
Example 3: Counting distinct values in a DataFrame with multiple columns
>>> from pyspark.sql.functions import approx_count_distinct, struct >>> df = spark.createDataFrame([("Alice", 1), ... ("Alice", 2), ... ("Bob", 3), ... ("Bob", 3)], ["name", "value"]) >>> df = df.withColumn("combined", struct("name", "value")) >>> df.agg(approx_count_distinct("combined").alias('distinct_pairs')).show() +--------------+ |distinct_pairs| +--------------+ | 3| +--------------+
Example 4: Counting distinct values with a specified relative standard deviation
>>> from pyspark.sql.functions import approx_count_distinct >>> df = spark.range(100000) >>> df.agg(approx_count_distinct("id").alias('with_default_rsd'), ... approx_count_distinct("id", 0.1).alias('with_rsd_0.1')).show() +----------------+------------+ |with_default_rsd|with_rsd_0.1| +----------------+------------+ | 95546| 102065| +----------------+------------+