Class BloomFilter

Object
org.apache.spark.util.sketch.BloomFilter

public abstract class BloomFilter extends Object
A Bloom filter is a space-efficient probabilistic data structure that offers an approximate containment test with one-sided error: if it claims that an item is contained in it, this might be in error, but if it claims that an item is not contained in it, then this is definitely true. Currently supported data types include: The false positive probability (FPP) of a Bloom filter is defined as the probability that mightContain(Object) will erroneously return true for an object that has not actually been put in the BloomFilter. The implementation is largely based on the BloomFilter class from Guava.
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static enum 
     
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    abstract long
    Returns the number of bits in the underlying bit array.
    long
     
    create(long expectedNumItems)
    Creates a BloomFilter with the expected number of insertions and a default expected false positive probability of 3%.
    create(long expectedNumItems, double fpp)
    Creates a BloomFilter with the expected number of insertions and expected false positive probability.
    create(long expectedNumItems, long numBits)
    Creates a BloomFilter with given expectedNumItems and numBits, it will pick an optimal numHashFunctions which can minimize fpp for the bloom filter.
    abstract double
    Returns the probability that mightContain(Object) erroneously return true for an object that has not actually been put in the BloomFilter.
    abstract BloomFilter
    Combines this bloom filter with another bloom filter by performing a bitwise AND of the underlying data.
    abstract boolean
    Determines whether a given bloom filter is compatible with this bloom filter.
    abstract BloomFilter
    Combines this bloom filter with another bloom filter by performing a bitwise OR of the underlying data.
    abstract boolean
    Returns true if the element might have been put in this Bloom filter, false if this is definitely not the case.
    abstract boolean
    mightContainBinary(byte[] item)
    A specialized variant of mightContain(Object) that only tests byte array items.
    abstract boolean
    mightContainLong(long item)
    A specialized variant of mightContain(Object) that only tests long items.
    abstract boolean
    A specialized variant of mightContain(Object) that only tests String items.
    static long
    optimalNumOfBits(long n, double p)
    Computes m (total bits of Bloom filter) which is expected to achieve, for the specified expected insertions, the required false positive probability.
    static long
    optimalNumOfBits(long expectedNumItems, long maxNumItems, long maxNumOfBits)
    Computes m (total bits of Bloom filter) which is expected to achieve.
    abstract boolean
    put(Object item)
    Puts an item into this BloomFilter.
    abstract boolean
    putBinary(byte[] item)
    A specialized variant of put(Object) that only supports byte array items.
    abstract boolean
    putLong(long item)
    A specialized variant of put(Object) that only supports long items.
    abstract boolean
    A specialized variant of put(Object) that only supports String items.
    Reads in a BloomFilter from an input stream.
    abstract void
    Writes out this BloomFilter to an output stream in binary format.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • BloomFilter

      public BloomFilter()
  • Method Details

    • expectedFpp

      public abstract double expectedFpp()
      Returns the probability that mightContain(Object) erroneously return true for an object that has not actually been put in the BloomFilter. Ideally, this number should be close to the fpp parameter passed in create(long, double), or smaller. If it is significantly higher, it is usually the case that too many items (more than expected) have been put in the BloomFilter, degenerating it.
    • bitSize

      public abstract long bitSize()
      Returns the number of bits in the underlying bit array.
    • put

      public abstract boolean put(Object item)
      Puts an item into this BloomFilter. Ensures that subsequent invocations of mightContain(Object) with the same item will always return true.
      Returns:
      true if the bloom filter's bits changed as a result of this operation. If the bits changed, this is definitely the first time object has been added to the filter. If the bits haven't changed, this might be the first time object has been added to the filter. Note that put(t) always returns the opposite result to what mightContain(t) would have returned at the time it is called.
    • putString

      public abstract boolean putString(String item)
      A specialized variant of put(Object) that only supports String items.
    • putLong

      public abstract boolean putLong(long item)
      A specialized variant of put(Object) that only supports long items.
    • putBinary

      public abstract boolean putBinary(byte[] item)
      A specialized variant of put(Object) that only supports byte array items.
    • isCompatible

      public abstract boolean isCompatible(BloomFilter other)
      Determines whether a given bloom filter is compatible with this bloom filter. For two bloom filters to be compatible, they must have the same bit size.
      Parameters:
      other - The bloom filter to check for compatibility.
    • mergeInPlace

      public abstract BloomFilter mergeInPlace(BloomFilter other) throws IncompatibleMergeException
      Combines this bloom filter with another bloom filter by performing a bitwise OR of the underlying data. The mutations happen to this instance. Callers must ensure the bloom filters are appropriately sized to avoid saturating them.
      Parameters:
      other - The bloom filter to combine this bloom filter with. It is not mutated.
      Throws:
      IncompatibleMergeException - if isCompatible(other) == false
    • intersectInPlace

      public abstract BloomFilter intersectInPlace(BloomFilter other) throws IncompatibleMergeException
      Combines this bloom filter with another bloom filter by performing a bitwise AND of the underlying data. The mutations happen to this instance. Callers must ensure the bloom filters are appropriately sized to avoid saturating them.
      Parameters:
      other - The bloom filter to combine this bloom filter with. It is not mutated.
      Throws:
      IncompatibleMergeException - if isCompatible(other) == false
    • mightContain

      public abstract boolean mightContain(Object item)
      Returns true if the element might have been put in this Bloom filter, false if this is definitely not the case.
    • mightContainString

      public abstract boolean mightContainString(String item)
      A specialized variant of mightContain(Object) that only tests String items.
    • mightContainLong

      public abstract boolean mightContainLong(long item)
      A specialized variant of mightContain(Object) that only tests long items.
    • mightContainBinary

      public abstract boolean mightContainBinary(byte[] item)
      A specialized variant of mightContain(Object) that only tests byte array items.
    • writeTo

      public abstract void writeTo(OutputStream out) throws IOException
      Writes out this BloomFilter to an output stream in binary format. It is the caller's responsibility to close the stream.
      Throws:
      IOException
    • cardinality

      public long cardinality()
      Returns:
      the number of set bits in this BloomFilter.
    • readFrom

      public static BloomFilter readFrom(InputStream in) throws IOException
      Reads in a BloomFilter from an input stream. It is the caller's responsibility to close the stream.
      Throws:
      IOException
    • optimalNumOfBits

      public static long optimalNumOfBits(long n, double p)
      Computes m (total bits of Bloom filter) which is expected to achieve, for the specified expected insertions, the required false positive probability. See http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives for the formula.
      Parameters:
      n - expected insertions (must be positive)
      p - false positive rate (must be 0 < p < 1)
    • optimalNumOfBits

      public static long optimalNumOfBits(long expectedNumItems, long maxNumItems, long maxNumOfBits)
      Computes m (total bits of Bloom filter) which is expected to achieve. The smaller the expectedNumItems, the smaller the fpp.
    • create

      public static BloomFilter create(long expectedNumItems)
      Creates a BloomFilter with the expected number of insertions and a default expected false positive probability of 3%. Note that overflowing a BloomFilter with significantly more elements than specified, will result in its saturation, and a sharp deterioration of its false positive probability.
    • create

      public static BloomFilter create(long expectedNumItems, double fpp)
      Creates a BloomFilter with the expected number of insertions and expected false positive probability. Note that overflowing a BloomFilter with significantly more elements than specified, will result in its saturation, and a sharp deterioration of its false positive probability.
    • create

      public static BloomFilter create(long expectedNumItems, long numBits)
      Creates a BloomFilter with given expectedNumItems and numBits, it will pick an optimal numHashFunctions which can minimize fpp for the bloom filter.