Package org.apache.spark.sql.util
Class NumericHistogram
Object
org.apache.spark.sql.util.NumericHistogram
A generic, re-usable histogram class that supports partial aggregations.
 The algorithm is a heuristic adapted from the following paper:
 Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
 J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
 guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
 of histogram bins.
 Adapted from Hive's NumericHistogram. Can refer to
 https://github.com/apache/hive/blob/master/ql/src/
 java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java
 Differences:
   1. Declaring [[Coord]] and it's variables as public types for
      easy access in the HistogramNumeric class.
   2. Add method [[getNumBins()]] for serialize [[NumericHistogram]]
      in [[NumericHistogramSerializer]].
   3. Add method [[addBin()]] for deserialize [[NumericHistogram]]
      in [[NumericHistogramSerializer]].
   4. In Hive's code, the method [[merge()] pass a serialized histogram,
      in Spark, this method pass a deserialized histogram.
      Here we change the code about merge bins.
- Since:
- 3.3.0
- 
Nested Class SummaryNested ClassesModifier and TypeClassDescriptionstatic classThe Coord class defines a histogram bin, which is just an (x,y) pair.
- 
Constructor SummaryConstructors
- 
Method SummaryModifier and TypeMethodDescriptionvoidadd(double v) Adds a new data point to the histogram approximation.voidaddBin(double x, double y, int b) Set a particular histogram bin with index.voidallocate(int num_bins) Sets the number of histogram bins to use for approximating data.getBin(int b) Returns a particular histogram bin.intReturns the number of bins.intReturns the number of bins currently being used by the histogram.booleanisReady()Returns true if this histogram object has been initialized by calling merge() or allocate().voidmerge(NumericHistogram other) Takes a histogram and merges it with the current histogram object.voidreset()Resets a histogram object to its initial state.voidsetUsedBins(int nusedBins) Set the number of bins currently being used by the histogram.
- 
Constructor Details- 
NumericHistogrampublic NumericHistogram()Creates a new histogram object. Note that the allocate() or merge() method must be called before the histogram can be used.
 
- 
- 
Method Details- 
resetpublic void reset()Resets a histogram object to its initial state. allocate() or merge() must be called again before use.
- 
getNumBinspublic int getNumBins()Returns the number of bins.
- 
getUsedBinspublic int getUsedBins()Returns the number of bins currently being used by the histogram.
- 
setUsedBinspublic void setUsedBins(int nusedBins) Set the number of bins currently being used by the histogram.
- 
isReadypublic boolean isReady()Returns true if this histogram object has been initialized by calling merge() or allocate().
- 
getBinReturns a particular histogram bin.
- 
addBinpublic void addBin(double x, double y, int b) Set a particular histogram bin with index.
- 
allocatepublic void allocate(int num_bins) Sets the number of histogram bins to use for approximating data.- Parameters:
- num_bins- Number of non-uniform-width histogram bins to use
 
- 
mergeTakes a histogram and merges it with the current histogram object.
- 
addpublic void add(double v) Adds a new data point to the histogram approximation. Make sure you have called either allocate() or merge() first. This method implements Algorithm #1 from Ben-Haim and Tom-Tov, "A Streaming Parallel Decision Tree Algorithm", JMLR 2010.- Parameters:
- v- The data point to add to the histogram approximation.
 
 
-