Packages

c

org.apache.spark.sql.util

NumericHistogram

class NumericHistogram extends AnyRef

A generic, re-usable histogram class that supports partial aggregations. The algorithm is a heuristic adapted from the following paper: Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number of histogram bins.

Adapted from Hive's NumericHistogram. Can refer to https://github.com/apache/hive/blob/master/ql/src/ java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java

Differences:

  1. Declaring Coord and it's variables as public types for easy access in the HistogramNumeric class. 2. Add method getNumBins() for serialize NumericHistogram in NumericHistogramSerializer. 3. Add method addBin() for deserialize NumericHistogram in NumericHistogramSerializer. 4. In Hive's code, the method pass a serialized histogram, in Spark, this method pass a deserialized histogram. Here we change the code about merge bins.
Source
NumericHistogram.java
Since

3.3.0

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. NumericHistogram
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Instance Constructors

  1. new NumericHistogram()

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##: Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. def add(v: Double): Unit

    Adds a new data point to the histogram approximation.

    Adds a new data point to the histogram approximation. Make sure you have called either allocate() or merge() first. This method implements Algorithm #1 from Ben-Haim and Tom-Tov, "A Streaming Parallel Decision Tree Algorithm", JMLR 2010.

    v

    The data point to add to the histogram approximation.

  5. def addBin(x: Double, y: Double, b: Int): Unit

    Set a particular histogram bin with index.

  6. def allocate(num_bins: Int): Unit

    Sets the number of histogram bins to use for approximating data.

    Sets the number of histogram bins to use for approximating data.

    num_bins

    Number of non-uniform-width histogram bins to use

  7. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  8. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
  9. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  10. def equals(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef → Any
  11. def getBin(b: Int): Coord

    Returns a particular histogram bin.

  12. final def getClass(): Class[_ <: AnyRef]
    Definition Classes
    AnyRef → Any
    Annotations
    @IntrinsicCandidate() @native()
  13. def getNumBins(): Int

    Returns the number of bins.

  14. def getUsedBins(): Int

    Returns the number of bins currently being used by the histogram.

  15. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @IntrinsicCandidate() @native()
  16. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  17. def isReady(): Boolean

    Returns true if this histogram object has been initialized by calling merge() or allocate().

  18. def merge(other: NumericHistogram): Unit

    Takes a histogram and merges it with the current histogram object.

  19. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  20. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @IntrinsicCandidate() @native()
  21. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @IntrinsicCandidate() @native()
  22. def reset(): Unit

    Resets a histogram object to its initial state.

    Resets a histogram object to its initial state. allocate() or merge() must be called again before use.

  23. def setUsedBins(nusedBins: Int): Unit

    Set the number of bins currently being used by the histogram.

  24. final def synchronized[T0](arg0: => T0): T0
    Definition Classes
    AnyRef
  25. def toString(): String
    Definition Classes
    AnyRef → Any
  26. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  27. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException]) @native()
  28. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])

Deprecated Value Members

  1. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.Throwable]) @Deprecated
    Deprecated

    (Since version 9)

Inherited from AnyRef

Inherited from Any

Ungrouped