Package org.apache.spark.sql.util
Class NumericHistogram
Object
org.apache.spark.sql.util.NumericHistogram
A generic, re-usable histogram class that supports partial aggregations.
The algorithm is a heuristic adapted from the following paper:
Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
of histogram bins.
Adapted from Hive's NumericHistogram. Can refer to
https://github.com/apache/hive/blob/master/ql/src/
java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java
Differences:
1. Declaring [[Coord]] and it's variables as public types for
easy access in the HistogramNumeric class.
2. Add method [[getNumBins()]] for serialize [[NumericHistogram]]
in [[NumericHistogramSerializer]].
3. Add method [[addBin()]] for deserialize [[NumericHistogram]]
in [[NumericHistogramSerializer]].
4. In Hive's code, the method [[merge()] pass a serialized histogram,
in Spark, this method pass a deserialized histogram.
Here we change the code about merge bins.
- Since:
- 3.3.0
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic class
The Coord class defines a histogram bin, which is just an (x,y) pair. -
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
add
(double v) Adds a new data point to the histogram approximation.void
addBin
(double x, double y, int b) Set a particular histogram bin with index.void
allocate
(int num_bins) Sets the number of histogram bins to use for approximating data.getBin
(int b) Returns a particular histogram bin.int
Returns the number of bins.int
Returns the number of bins currently being used by the histogram.boolean
isReady()
Returns true if this histogram object has been initialized by calling merge() or allocate().void
merge
(NumericHistogram other) Takes a histogram and merges it with the current histogram object.void
reset()
Resets a histogram object to its initial state.void
setUsedBins
(int nusedBins) Set the number of bins currently being used by the histogram.
-
Constructor Details
-
NumericHistogram
public NumericHistogram()Creates a new histogram object. Note that the allocate() or merge() method must be called before the histogram can be used.
-
-
Method Details
-
reset
public void reset()Resets a histogram object to its initial state. allocate() or merge() must be called again before use. -
getNumBins
public int getNumBins()Returns the number of bins. -
getUsedBins
public int getUsedBins()Returns the number of bins currently being used by the histogram. -
setUsedBins
public void setUsedBins(int nusedBins) Set the number of bins currently being used by the histogram. -
isReady
public boolean isReady()Returns true if this histogram object has been initialized by calling merge() or allocate(). -
getBin
Returns a particular histogram bin. -
addBin
public void addBin(double x, double y, int b) Set a particular histogram bin with index. -
allocate
public void allocate(int num_bins) Sets the number of histogram bins to use for approximating data.- Parameters:
num_bins
- Number of non-uniform-width histogram bins to use
-
merge
Takes a histogram and merges it with the current histogram object. -
add
public void add(double v) Adds a new data point to the histogram approximation. Make sure you have called either allocate() or merge() first. This method implements Algorithm #1 from Ben-Haim and Tom-Tov, "A Streaming Parallel Decision Tree Algorithm", JMLR 2010.- Parameters:
v
- The data point to add to the histogram approximation.
-