public class SamplingUtils
extends Object
| Constructor and Description | 
|---|
| SamplingUtils() | 
| Modifier and Type | Method and Description | 
|---|---|
| static double | computeFractionForSampleSize(int sampleSizeLowerBound,
                            long total,
                            boolean withReplacement)Returns a sampling rate that guarantees a sample of size greater than or equal to
 sampleSizeLowerBound 99.99% of the time. | 
| static <T> scala.Tuple2<Object,Object> | reservoirSampleAndCount(scala.collection.Iterator<T> input,
                       int k,
                       long seed,
                       scala.reflect.ClassTag<T> evidence$1)Reservoir sampling implementation that also returns the input size. | 
public static <T> scala.Tuple2<Object,Object> reservoirSampleAndCount(scala.collection.Iterator<T> input,
                                                                      int k,
                                                                      long seed,
                                                                      scala.reflect.ClassTag<T> evidence$1)
input - input sizek - reservoir sizeseed - random seedevidence$1 - (undocumented)public static double computeFractionForSampleSize(int sampleSizeLowerBound,
                                                  long total,
                                                  boolean withReplacement)
How the sampling rate is determined:
Let p = num / total, where num is the sample size and total is the total number of datapoints in the RDD. We're trying to compute q > p such that - when sampling with replacement, we're drawing each datapoint with prob_i ~ Pois(q), where we want to guarantee Pr[s < num] < 0.0001 for s = sum(prob_i for i from 0 to total), i.e. the failure rate of not having a sufficiently large sample < 0.0001. Setting q = p + 5 * sqrt(p/total) is sufficient to guarantee 0.9999 success rate for num > 12, but we need a slightly larger q (9 empirically determined). - when sampling without replacement, we're drawing each datapoint with prob_i ~ Binomial(total, fraction) and our choice of q guarantees 1-delta, or 0.9999 success rate, where success rate is defined the same as in sampling with replacement.
The smallest sampling rate supported is 1e-10 (in order to avoid running into the limit of the RNG's resolution).
sampleSizeLowerBound - sample sizetotal - size of RDDwithReplacement - whether sampling with replacement