Class SamplingUtils

Constructor Summary

Method Summary
Modifier and TypeMethodDescriptionstatic double
computeFractionForSampleSize
(int sampleSizeLowerBound, long total, boolean withReplacement) Returns a sampling rate that guarantees a sample of size greater than or equal to sampleSizeLowerBound 99.99% of the time.reservoirSampleAndCount
(scala.collection.Iterator<T> input, int k, long seed, scala.reflect.ClassTag<T> evidence$1) Reservoir sampling implementation that also returns the input size.

Constructor Details

SamplingUtils
public SamplingUtils()


Method Details

reservoirSampleAndCount
public static <T> scala.Tuple2<Object,Object> reservoirSampleAndCount(scala.collection.Iterator<T> input, int k, long seed, scala.reflect.ClassTag<T> evidence$1) Reservoir sampling implementation that also returns the input size. Parameters:
input
 input sizek
 reservoir sizeseed
 random seedevidence$1
 (undocumented) Returns:
 (samples, input size)

computeFractionForSampleSize
public static double computeFractionForSampleSize(int sampleSizeLowerBound, long total, boolean withReplacement) Returns a sampling rate that guarantees a sample of size greater than or equal to sampleSizeLowerBound 99.99% of the time.How the sampling rate is determined:
Let p = num / total, where num is the sample size and total is the total number of datapoints in the RDD. We're trying to compute q > p such that  when sampling with replacement, we're drawing each datapoint with prob_i ~ Pois(q), where we want to guarantee Pr[s < num] < 0.0001 for s = sum(prob_i for i from 0 to total), i.e. the failure rate of not having a sufficiently large sample < 0.0001. Setting q = p + 5 * sqrt(p/total) is sufficient to guarantee 0.9999 success rate for num > 12, but we need a slightly larger q (9 empirically determined).  when sampling without replacement, we're drawing each datapoint with prob_i ~ Binomial(total, fraction) and our choice of q guarantees 1delta, or 0.9999 success rate, where success rate is defined the same as in sampling with replacement.
The smallest sampling rate supported is 1e10 (in order to avoid running into the limit of the RNG's resolution).
 Parameters:
sampleSizeLowerBound
 sample sizetotal
 size of RDDwithReplacement
 whether sampling with replacement Returns:
 a sampling rate that guarantees sufficient sample size with 99.99% success rate
