pyspark.RDD.countApprox#

RDD.countApprox(timeout, confidence=0.95)[source]#

Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.

New in version 1.2.0.

Parameters

timeoutint: maximum time to wait for the job, in milliseconds
confidencefloat: the desired statistical confidence in the result

Returns

int: a potentially incomplete result, with error bounds

See also

RDD.count()

Examples

>>> rdd = sc.parallelize(range(1000), 10)
>>> rdd.countApprox(1000, 1.0)
1000