Package org.apache.spark.util.sketch
Class BloomFilter
Object
org.apache.spark.util.sketch.BloomFilter
A Bloom filter is a space-efficient probabilistic data structure that offers an approximate
containment test with one-sided error: if it claims that an item is contained in it, this
might be in error, but if it claims that an item is not contained in it, then this is
definitely true. Currently supported data types include:
The false positive probability (
FPP
) of a Bloom filter is defined as the probability that
mightContain(Object) will erroneously return true
for an object that has
not actually been put in the BloomFilter
.
The implementation is largely based on the BloomFilter
class from Guava.-
Nested Class Summary
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionabstract long
bitSize()
Returns the number of bits in the underlying bit array.long
static BloomFilter
create
(long expectedNumItems) Creates aBloomFilter
with the expected number of insertions and a default expected false positive probability of 3%.static BloomFilter
create
(long expectedNumItems, double fpp) Creates aBloomFilter
with the expected number of insertions and expected false positive probability.static BloomFilter
create
(long expectedNumItems, long numBits) Creates aBloomFilter
with givenexpectedNumItems
andnumBits
, it will pick an optimalnumHashFunctions
which can minimizefpp
for the bloom filter.abstract double
Returns the probability that mightContain(Object) erroneously returntrue
for an object that has not actually been put in theBloomFilter
.abstract BloomFilter
intersectInPlace
(BloomFilter other) Combines this bloom filter with another bloom filter by performing a bitwise AND of the underlying data.abstract boolean
isCompatible
(BloomFilter other) Determines whether a given bloom filter is compatible with this bloom filter.abstract BloomFilter
mergeInPlace
(BloomFilter other) Combines this bloom filter with another bloom filter by performing a bitwise OR of the underlying data.abstract boolean
mightContain
(Object item) Returnstrue
if the element might have been put in this Bloom filter,false
if this is definitely not the case.abstract boolean
mightContainBinary
(byte[] item) A specialized variant ofmightContain(Object)
that only tests byte array items.abstract boolean
mightContainLong
(long item) A specialized variant ofmightContain(Object)
that only testslong
items.abstract boolean
mightContainString
(String item) A specialized variant ofmightContain(Object)
that only testsString
items.static long
optimalNumOfBits
(long n, double p) Computes m (total bits of Bloom filter) which is expected to achieve, for the specified expected insertions, the required false positive probability.static long
optimalNumOfBits
(long expectedNumItems, long maxNumItems, long maxNumOfBits) Computes m (total bits of Bloom filter) which is expected to achieve.abstract boolean
Puts an item into thisBloomFilter
.abstract boolean
putBinary
(byte[] item) A specialized variant ofput(Object)
that only supports byte array items.abstract boolean
putLong
(long item) A specialized variant ofput(Object)
that only supportslong
items.abstract boolean
A specialized variant ofput(Object)
that only supportsString
items.static BloomFilter
readFrom
(InputStream in) Reads in aBloomFilter
from an input stream.abstract void
writeTo
(OutputStream out) Writes out thisBloomFilter
to an output stream in binary format.
-
Constructor Details
-
BloomFilter
public BloomFilter()
-
-
Method Details
-
expectedFpp
public abstract double expectedFpp()Returns the probability that mightContain(Object) erroneously returntrue
for an object that has not actually been put in theBloomFilter
. Ideally, this number should be close to thefpp
parameter passed in create(long, double), or smaller. If it is significantly higher, it is usually the case that too many items (more than expected) have been put in theBloomFilter
, degenerating it. -
bitSize
public abstract long bitSize()Returns the number of bits in the underlying bit array. -
put
Puts an item into thisBloomFilter
. Ensures that subsequent invocations of mightContain(Object) with the same item will always returntrue
.- Returns:
- true if the bloom filter's bits changed as a result of this operation. If the bits
changed, this is definitely the first time
object
has been added to the filter. If the bits haven't changed, this might be the first timeobject
has been added to the filter. Note thatput(t)
always returns the opposite result to whatmightContain(t)
would have returned at the time it is called.
-
putString
A specialized variant ofput(Object)
that only supportsString
items. -
putLong
public abstract boolean putLong(long item) A specialized variant ofput(Object)
that only supportslong
items. -
putBinary
public abstract boolean putBinary(byte[] item) A specialized variant ofput(Object)
that only supports byte array items. -
isCompatible
Determines whether a given bloom filter is compatible with this bloom filter. For two bloom filters to be compatible, they must have the same bit size.- Parameters:
other
- The bloom filter to check for compatibility.
-
mergeInPlace
Combines this bloom filter with another bloom filter by performing a bitwise OR of the underlying data. The mutations happen to this instance. Callers must ensure the bloom filters are appropriately sized to avoid saturating them.- Parameters:
other
- The bloom filter to combine this bloom filter with. It is not mutated.- Throws:
IncompatibleMergeException
- ifisCompatible(other) == false
-
intersectInPlace
Combines this bloom filter with another bloom filter by performing a bitwise AND of the underlying data. The mutations happen to this instance. Callers must ensure the bloom filters are appropriately sized to avoid saturating them.- Parameters:
other
- The bloom filter to combine this bloom filter with. It is not mutated.- Throws:
IncompatibleMergeException
- ifisCompatible(other) == false
-
mightContain
Returnstrue
if the element might have been put in this Bloom filter,false
if this is definitely not the case. -
mightContainString
A specialized variant ofmightContain(Object)
that only testsString
items. -
mightContainLong
public abstract boolean mightContainLong(long item) A specialized variant ofmightContain(Object)
that only testslong
items. -
mightContainBinary
public abstract boolean mightContainBinary(byte[] item) A specialized variant ofmightContain(Object)
that only tests byte array items. -
writeTo
Writes out thisBloomFilter
to an output stream in binary format. It is the caller's responsibility to close the stream.- Throws:
IOException
-
cardinality
public long cardinality()- Returns:
- the number of set bits in this
BloomFilter
.
-
readFrom
Reads in aBloomFilter
from an input stream. It is the caller's responsibility to close the stream.- Throws:
IOException
-
optimalNumOfBits
public static long optimalNumOfBits(long n, double p) Computes m (total bits of Bloom filter) which is expected to achieve, for the specified expected insertions, the required false positive probability. See http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives for the formula.- Parameters:
n
- expected insertions (must be positive)p
- false positive rate (must be 0 < p < 1)
-
optimalNumOfBits
public static long optimalNumOfBits(long expectedNumItems, long maxNumItems, long maxNumOfBits) Computes m (total bits of Bloom filter) which is expected to achieve. The smaller the expectedNumItems, the smaller the fpp. -
create
Creates aBloomFilter
with the expected number of insertions and a default expected false positive probability of 3%. Note that overflowing aBloomFilter
with significantly more elements than specified, will result in its saturation, and a sharp deterioration of its false positive probability. -
create
Creates aBloomFilter
with the expected number of insertions and expected false positive probability. Note that overflowing aBloomFilter
with significantly more elements than specified, will result in its saturation, and a sharp deterioration of its false positive probability. -
create
Creates aBloomFilter
with givenexpectedNumItems
andnumBits
, it will pick an optimalnumHashFunctions
which can minimizefpp
for the bloom filter.
-