Packages

c

org.apache.spark.sql.util

NumericHistogram

class NumericHistogram extends AnyRef

A generic, re-usable histogram class that supports partial aggregations. The algorithm is a heuristic adapted from the following paper: Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number of histogram bins.

Adapted from Hive's NumericHistogram. Can refer to https://github.com/apache/hive/blob/master/ql/src/ java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java

Differences:

  1. Declaring Coord and it's variables as public types for easy access in the HistogramNumeric class. 2. Add method getNumBins() for serialize NumericHistogram in NumericHistogramSerializer. 3. Add method addBin() for deserialize NumericHistogram in NumericHistogramSerializer. 4. In Hive's code, the method pass a serialized histogram, in Spark, this method pass a deserialized histogram. Here we change the code about merge bins.
Source
NumericHistogram.java
Since

3.3.0

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. NumericHistogram
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Instance Constructors

  1. new NumericHistogram()

Value Members

  1. def add(v: Double): Unit

    Adds a new data point to the histogram approximation.

    Adds a new data point to the histogram approximation. Make sure you have called either allocate() or merge() first. This method implements Algorithm #1 from Ben-Haim and Tom-Tov, "A Streaming Parallel Decision Tree Algorithm", JMLR 2010.

    v

    The data point to add to the histogram approximation.

  2. def addBin(x: Double, y: Double, b: Int): Unit

    Set a particular histogram bin with index.

  3. def allocate(num_bins: Int): Unit

    Sets the number of histogram bins to use for approximating data.

    Sets the number of histogram bins to use for approximating data.

    num_bins

    Number of non-uniform-width histogram bins to use

  4. def getBin(b: Int): Coord

    Returns a particular histogram bin.

  5. def getNumBins(): Int

    Returns the number of bins.

  6. def getUsedBins(): Int

    Returns the number of bins currently being used by the histogram.

  7. def isReady(): Boolean

    Returns true if this histogram object has been initialized by calling merge() or allocate().

  8. def merge(other: NumericHistogram): Unit

    Takes a histogram and merges it with the current histogram object.

  9. def reset(): Unit

    Resets a histogram object to its initial state.

    Resets a histogram object to its initial state. allocate() or merge() must be called again before use.

  10. def setUsedBins(nusedBins: Int): Unit

    Set the number of bins currently being used by the histogram.