Class NumericHistogram

Object
org.apache.spark.sql.util.NumericHistogram

public class NumericHistogram extends Object
A generic, re-usable histogram class that supports partial aggregations. The algorithm is a heuristic adapted from the following paper: Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number of histogram bins. Adapted from Hive's NumericHistogram. Can refer to https://github.com/apache/hive/blob/master/ql/src/ java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java Differences: 1. Declaring [[Coord]] and it's variables as public types for easy access in the HistogramNumeric class. 2. Add method [[getNumBins()]] for serialize [[NumericHistogram]] in [[NumericHistogramSerializer]]. 3. Add method [[addBin()]] for deserialize [[NumericHistogram]] in [[NumericHistogramSerializer]]. 4. In Hive's code, the method [[merge()] pass a serialized histogram, in Spark, this method pass a deserialized histogram. Here we change the code about merge bins.
Since:
3.3.0
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static class 
    The Coord class defines a histogram bin, which is just an (x,y) pair.
  • Constructor Summary

    Constructors
    Constructor
    Description
    Creates a new histogram object.
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    add(double v)
    Adds a new data point to the histogram approximation.
    void
    addBin(double x, double y, int b)
    Set a particular histogram bin with index.
    void
    allocate(int num_bins)
    Sets the number of histogram bins to use for approximating data.
    getBin(int b)
    Returns a particular histogram bin.
    int
    Returns the number of bins.
    int
    Returns the number of bins currently being used by the histogram.
    boolean
    Returns true if this histogram object has been initialized by calling merge() or allocate().
    void
    Takes a histogram and merges it with the current histogram object.
    void
    Resets a histogram object to its initial state.
    void
    setUsedBins(int nusedBins)
    Set the number of bins currently being used by the histogram.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • NumericHistogram

      public NumericHistogram()
      Creates a new histogram object. Note that the allocate() or merge() method must be called before the histogram can be used.
  • Method Details

    • reset

      public void reset()
      Resets a histogram object to its initial state. allocate() or merge() must be called again before use.
    • getNumBins

      public int getNumBins()
      Returns the number of bins.
    • getUsedBins

      public int getUsedBins()
      Returns the number of bins currently being used by the histogram.
    • setUsedBins

      public void setUsedBins(int nusedBins)
      Set the number of bins currently being used by the histogram.
    • isReady

      public boolean isReady()
      Returns true if this histogram object has been initialized by calling merge() or allocate().
    • getBin

      public NumericHistogram.Coord getBin(int b)
      Returns a particular histogram bin.
    • addBin

      public void addBin(double x, double y, int b)
      Set a particular histogram bin with index.
    • allocate

      public void allocate(int num_bins)
      Sets the number of histogram bins to use for approximating data.
      Parameters:
      num_bins - Number of non-uniform-width histogram bins to use
    • merge

      public void merge(NumericHistogram other)
      Takes a histogram and merges it with the current histogram object.
    • add

      public void add(double v)
      Adds a new data point to the histogram approximation. Make sure you have called either allocate() or merge() first. This method implements Algorithm #1 from Ben-Haim and Tom-Tov, "A Streaming Parallel Decision Tree Algorithm", JMLR 2010.
      Parameters:
      v - The data point to add to the histogram approximation.