pyspark.RDD.countApprox¶
-
RDD.
countApprox
(timeout: int, confidence: float = 0.95) → int[source]¶ Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.
New in version 1.2.0.
- Parameters
- timeoutint
maximum time to wait for the job, in milliseconds
- confidencefloat
the desired statistical confidence in the result
- Returns
- int
a potentially incomplete result, with error bounds
See also
Examples
>>> rdd = sc.parallelize(range(1000), 10) >>> rdd.countApprox(1000, 1.0) 1000