Packages

c

org.apache.spark.util.sketch

CountMinSketch

abstract class CountMinSketch extends AnyRef

A Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space. Currently, supported data types include:

  • Byte
  • Short
  • Integer
  • Long
  • String

A CountMinSketch is initialized with a random seed, and a pair of parameters:

  • relative error (or eps), and
  • confidence (or delta)

Suppose you want to estimate the number of times an element x has appeared in a data stream so far. With probability delta, the estimate of this frequency is within the range true frequency <= estimate <= true frequency + eps * N, where N is the total count of items have appeared the data stream so far.

Under the cover, a CountMinSketch is essentially a two-dimensional long array with depth d and width w, where

  • d = ceil(2 / eps)
  • w = ceil(-log(1 - confidence) / log(2))

This implementation is largely based on the CountMinSketch class from stream-lib.

Source
CountMinSketch.java
Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. CountMinSketch
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Instance Constructors

  1. new CountMinSketch()

Abstract Value Members

  1. abstract def add(item: AnyRef, count: Long): Unit

    Increments item's count by count.

  2. abstract def add(item: AnyRef): Unit

    Increments item's count by one.

  3. abstract def addBinary(item: Array[Byte], count: Long): Unit

    Increments item's count by count.

  4. abstract def addBinary(item: Array[Byte]): Unit

    Increments item's count by one.

  5. abstract def addLong(item: Long, count: Long): Unit

    Increments item's count by count.

  6. abstract def addLong(item: Long): Unit

    Increments item's count by one.

  7. abstract def addString(item: String, count: Long): Unit

    Increments item's count by count.

  8. abstract def addString(item: String): Unit

    Increments item's count by one.

  9. abstract def confidence(): Double

    Returns the confidence (or delta) of this CountMinSketch.

  10. abstract def depth(): Int

    Depth of this CountMinSketch.

  11. abstract def estimateCount(item: AnyRef): Long

    Returns the estimated frequency of item.

  12. abstract def mergeInPlace(other: CountMinSketch): CountMinSketch

    Merges another CountMinSketch with this one in place.

    Merges another CountMinSketch with this one in place.

    Note that only Count-Min sketches with the same depth, width, and random seed can be merged.

  13. abstract def relativeError(): Double

    Returns the relative error (or eps) of this CountMinSketch.

  14. abstract def toByteArray(): Array[Byte]

    Serializes this CountMinSketch and returns the serialized form.

  15. abstract def totalCount(): Long

    Total count of items added to this CountMinSketch so far.

  16. abstract def width(): Int

    Width of this CountMinSketch.

  17. abstract def writeTo(out: OutputStream): Unit

    Writes out this CountMinSketch to an output stream in binary format.

    Writes out this CountMinSketch to an output stream in binary format. It is the caller's responsibility to close the stream.