Package org.apache.spark.util.sketch
Class BloomFilter
Object
org.apache.spark.util.sketch.BloomFilter
A Bloom filter is a space-efficient probabilistic data structure that offers an approximate
containment test with one-sided error: if it claims that an item is contained in it, this
might be in error, but if it claims that an item is not contained in it, then this is
definitely true. Currently supported data types include:
The false positive probability (
FPP) of a Bloom filter is defined as the probability that
mightContain(Object) will erroneously return true for an object that has
not actually been put in the BloomFilter.
The implementation is largely based on the BloomFilter class from Guava.-
Nested Class Summary
Nested Classes -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionabstract longbitSize()Returns the number of bits in the underlying bit array.longstatic BloomFiltercreate(long expectedNumItems) Creates aBloomFilterwith the expected number of insertions and a default expected false positive probability of 3%.static BloomFiltercreate(long expectedNumItems, double fpp) Creates aBloomFilterwith the expected number of insertions and expected false positive probability.static BloomFiltercreate(long expectedNumItems, long numBits) Creates aBloomFilterwith givenexpectedNumItemsandnumBits, it will pick an optimalnumHashFunctionswhich can minimizefppfor the bloom filter.abstract doubleReturns the probability that mightContain(Object) erroneously returntruefor an object that has not actually been put in theBloomFilter.abstract BloomFilterintersectInPlace(BloomFilter other) Combines this bloom filter with another bloom filter by performing a bitwise AND of the underlying data.abstract booleanisCompatible(BloomFilter other) Determines whether a given bloom filter is compatible with this bloom filter.abstract BloomFiltermergeInPlace(BloomFilter other) Combines this bloom filter with another bloom filter by performing a bitwise OR of the underlying data.abstract booleanmightContain(Object item) Returnstrueif the element might have been put in this Bloom filter,falseif this is definitely not the case.abstract booleanmightContainBinary(byte[] item) A specialized variant ofmightContain(Object)that only tests byte array items.abstract booleanmightContainLong(long item) A specialized variant ofmightContain(Object)that only testslongitems.abstract booleanmightContainString(String item) A specialized variant ofmightContain(Object)that only testsStringitems.static longoptimalNumOfBits(long n, double p) Computes m (total bits of Bloom filter) which is expected to achieve, for the specified expected insertions, the required false positive probability.static longoptimalNumOfBits(long expectedNumItems, long maxNumItems, long maxNumOfBits) Computes m (total bits of Bloom filter) which is expected to achieve.abstract booleanPuts an item into thisBloomFilter.abstract booleanputBinary(byte[] item) A specialized variant ofput(Object)that only supports byte array items.abstract booleanputLong(long item) A specialized variant ofput(Object)that only supportslongitems.abstract booleanA specialized variant ofput(Object)that only supportsStringitems.static BloomFilterreadFrom(byte[] bytes) Reads in aBloomFilterfrom a byte array.static BloomFilterreadFrom(InputStream in) Reads in aBloomFilterfrom an input stream.abstract voidwriteTo(OutputStream out) Writes out thisBloomFilterto an output stream in binary format.
-
Constructor Details
-
BloomFilter
public BloomFilter()
-
-
Method Details
-
expectedFpp
public abstract double expectedFpp()Returns the probability that mightContain(Object) erroneously returntruefor an object that has not actually been put in theBloomFilter. Ideally, this number should be close to thefppparameter passed in create(long, double), or smaller. If it is significantly higher, it is usually the case that too many items (more than expected) have been put in theBloomFilter, degenerating it. -
bitSize
public abstract long bitSize()Returns the number of bits in the underlying bit array. -
put
Puts an item into thisBloomFilter. Ensures that subsequent invocations of mightContain(Object) with the same item will always returntrue.- Returns:
- true if the bloom filter's bits changed as a result of this operation. If the bits
changed, this is definitely the first time
objecthas been added to the filter. If the bits haven't changed, this might be the first timeobjecthas been added to the filter. Note thatput(t)always returns the opposite result to whatmightContain(t)would have returned at the time it is called.
-
putString
A specialized variant ofput(Object)that only supportsStringitems. -
putLong
public abstract boolean putLong(long item) A specialized variant ofput(Object)that only supportslongitems. -
putBinary
public abstract boolean putBinary(byte[] item) A specialized variant ofput(Object)that only supports byte array items. -
isCompatible
Determines whether a given bloom filter is compatible with this bloom filter. For two bloom filters to be compatible, they must have the same bit size.- Parameters:
other- The bloom filter to check for compatibility.
-
mergeInPlace
Combines this bloom filter with another bloom filter by performing a bitwise OR of the underlying data. The mutations happen to this instance. Callers must ensure the bloom filters are appropriately sized to avoid saturating them.- Parameters:
other- The bloom filter to combine this bloom filter with. It is not mutated.- Throws:
IncompatibleMergeException- ifisCompatible(other) == false
-
intersectInPlace
Combines this bloom filter with another bloom filter by performing a bitwise AND of the underlying data. The mutations happen to this instance. Callers must ensure the bloom filters are appropriately sized to avoid saturating them.- Parameters:
other- The bloom filter to combine this bloom filter with. It is not mutated.- Throws:
IncompatibleMergeException- ifisCompatible(other) == false
-
mightContain
Returnstrueif the element might have been put in this Bloom filter,falseif this is definitely not the case. -
mightContainString
A specialized variant ofmightContain(Object)that only testsStringitems. -
mightContainLong
public abstract boolean mightContainLong(long item) A specialized variant ofmightContain(Object)that only testslongitems. -
mightContainBinary
public abstract boolean mightContainBinary(byte[] item) A specialized variant ofmightContain(Object)that only tests byte array items. -
writeTo
Writes out thisBloomFilterto an output stream in binary format. It is the caller's responsibility to close the stream.- Throws:
IOException
-
cardinality
public long cardinality()- Returns:
- the number of set bits in this
BloomFilter.
-
readFrom
Reads in aBloomFilterfrom an input stream. It is the caller's responsibility to close the stream.- Throws:
IOException
-
readFrom
Reads in aBloomFilterfrom a byte array.- Throws:
IOException
-
optimalNumOfBits
public static long optimalNumOfBits(long n, double p) Computes m (total bits of Bloom filter) which is expected to achieve, for the specified expected insertions, the required false positive probability. See http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives for the formula.- Parameters:
n- expected insertions (must be positive)p- false positive rate (must be 0 < p < 1)
-
optimalNumOfBits
public static long optimalNumOfBits(long expectedNumItems, long maxNumItems, long maxNumOfBits) Computes m (total bits of Bloom filter) which is expected to achieve. The smaller the expectedNumItems, the smaller the fpp. -
create
Creates aBloomFilterwith the expected number of insertions and a default expected false positive probability of 3%. Note that overflowing aBloomFilterwith significantly more elements than specified, will result in its saturation, and a sharp deterioration of its false positive probability. -
create
Creates aBloomFilterwith the expected number of insertions and expected false positive probability. Note that overflowing aBloomFilterwith significantly more elements than specified, will result in its saturation, and a sharp deterioration of its false positive probability. -
create
Creates aBloomFilterwith givenexpectedNumItemsandnumBits, it will pick an optimalnumHashFunctionswhich can minimizefppfor the bloom filter.
-