pyspark.RDD.countApproxDistinct

RDD.countApproxDistinct(relativeSD=0.05)[source]

Return approximate number of distinct elements in the RDD.

Parameters:
relativeSDfloat, optional

Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017.

Notes

The algorithm used is based on streamlib’s implementation of “HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm”, available here.

Examples

>>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
>>> 900 < n < 1100
True
>>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct()
>>> 16 < n < 24
True