pyspark.RDD.sumApprox

RDD.sumApprox(timeout, confidence=0.95)[source]

Approximate operation to return the sum within a timeout or meet the confidence.

Examples

>>> rdd = sc.parallelize(range(1000), 10)
>>> r = sum(range(1000))
>>> abs(rdd.sumApprox(1000) - r) / r < 0.05
True