Class BloomFilter

Object
org.apache.spark.util.sketch.BloomFilter

public abstract class BloomFilter extends Object
A Bloom filter is a space-efficient probabilistic data structure that offers an approximate containment test with one-sided error: if it claims that an item is contained in it, this might be in error, but if it claims that an item is not contained in it, then this is definitely true. Currently supported data types include: The false positive probability (FPP) of a Bloom filter is defined as the probability that mightContain(Object) will erroneously return true for an object that has not actually been put in the BloomFilter. The implementation is largely based on the BloomFilter class from Guava.
  • Constructor Details

    • BloomFilter

      public BloomFilter()
  • Method Details

    • expectedFpp

      public abstract double expectedFpp()
      Returns the probability that mightContain(Object) erroneously return true for an object that has not actually been put in the BloomFilter. Ideally, this number should be close to the fpp parameter passed in create(long, double), or smaller. If it is significantly higher, it is usually the case that too many items (more than expected) have been put in the BloomFilter, degenerating it.
    • bitSize

      public abstract long bitSize()
      Returns the number of bits in the underlying bit array.
    • put

      public abstract boolean put(Object item)
      Puts an item into this BloomFilter. Ensures that subsequent invocations of mightContain(Object) with the same item will always return true.
      Returns:
      true if the bloom filter's bits changed as a result of this operation. If the bits changed, this is definitely the first time object has been added to the filter. If the bits haven't changed, this might be the first time object has been added to the filter. Note that put(t) always returns the opposite result to what mightContain(t) would have returned at the time it is called.
    • putString

      public abstract boolean putString(String item)
      A specialized variant of put(Object) that only supports String items.
    • putLong

      public abstract boolean putLong(long item)
      A specialized variant of put(Object) that only supports long items.
    • putBinary

      public abstract boolean putBinary(byte[] item)
      A specialized variant of put(Object) that only supports byte array items.
    • isCompatible

      public abstract boolean isCompatible(BloomFilter other)
      Determines whether a given bloom filter is compatible with this bloom filter. For two bloom filters to be compatible, they must have the same bit size.
      Parameters:
      other - The bloom filter to check for compatibility.
    • mergeInPlace

      public abstract BloomFilter mergeInPlace(BloomFilter other) throws IncompatibleMergeException
      Combines this bloom filter with another bloom filter by performing a bitwise OR of the underlying data. The mutations happen to this instance. Callers must ensure the bloom filters are appropriately sized to avoid saturating them.
      Parameters:
      other - The bloom filter to combine this bloom filter with. It is not mutated.
      Throws:
      IncompatibleMergeException - if isCompatible(other) == false
    • intersectInPlace

      public abstract BloomFilter intersectInPlace(BloomFilter other) throws IncompatibleMergeException
      Combines this bloom filter with another bloom filter by performing a bitwise AND of the underlying data. The mutations happen to this instance. Callers must ensure the bloom filters are appropriately sized to avoid saturating them.
      Parameters:
      other - The bloom filter to combine this bloom filter with. It is not mutated.
      Throws:
      IncompatibleMergeException - if isCompatible(other) == false
    • mightContain

      public abstract boolean mightContain(Object item)
      Returns true if the element might have been put in this Bloom filter, false if this is definitely not the case.
    • mightContainString

      public abstract boolean mightContainString(String item)
      A specialized variant of mightContain(Object) that only tests String items.
    • mightContainLong

      public abstract boolean mightContainLong(long item)
      A specialized variant of mightContain(Object) that only tests long items.
    • mightContainBinary

      public abstract boolean mightContainBinary(byte[] item)
      A specialized variant of mightContain(Object) that only tests byte array items.
    • writeTo

      public abstract void writeTo(OutputStream out) throws IOException
      Writes out this BloomFilter to an output stream in binary format. It is the caller's responsibility to close the stream.
      Throws:
      IOException
    • cardinality

      public long cardinality()
      Returns:
      the number of set bits in this BloomFilter.
    • readFrom

      public static BloomFilter readFrom(InputStream in) throws IOException
      Reads in a BloomFilter from an input stream. It is the caller's responsibility to close the stream.
      Throws:
      IOException
    • readFrom

      public static BloomFilter readFrom(byte[] bytes) throws IOException
      Reads in a BloomFilter from a byte array.
      Throws:
      IOException
    • optimalNumOfBits

      public static long optimalNumOfBits(long n, double p)
      Computes m (total bits of Bloom filter) which is expected to achieve, for the specified expected insertions, the required false positive probability. See http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives for the formula.
      Parameters:
      n - expected insertions (must be positive)
      p - false positive rate (must be 0 < p < 1)
    • optimalNumOfBits

      public static long optimalNumOfBits(long expectedNumItems, long maxNumItems, long maxNumOfBits)
      Computes m (total bits of Bloom filter) which is expected to achieve. The smaller the expectedNumItems, the smaller the fpp.
    • create

      public static BloomFilter create(long expectedNumItems)
      Creates a BloomFilter with the expected number of insertions and a default expected false positive probability of 3%. Note that overflowing a BloomFilter with significantly more elements than specified, will result in its saturation, and a sharp deterioration of its false positive probability.
    • create

      public static BloomFilter create(long expectedNumItems, double fpp)
      Creates a BloomFilter with the expected number of insertions and expected false positive probability. Note that overflowing a BloomFilter with significantly more elements than specified, will result in its saturation, and a sharp deterioration of its false positive probability.
    • create

      public static BloomFilter create(long expectedNumItems, long numBits)
      Creates a BloomFilter with given expectedNumItems and numBits, it will pick an optimal numHashFunctions which can minimize fpp for the bloom filter.