pyspark.RDD.countApprox¶

RDD.countApprox(timeout: int, confidence: float = 0.95) → int[source]¶

Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.

New in version 1.2.0.

Parameters

Returns

See also

Examples

>>> rdd = sc.parallelize(range(1000), 10)
>>> rdd.countApprox(1000, 1.0)
1000

pyspark.RDD.count

pyspark.RDD.countApproxDistinct