org.apache.hadoop.hive.ql.udf.generic
Class NumericHistogram

java.lang.Object
  extended by org.apache.hadoop.hive.ql.udf.generic.NumericHistogram

public class NumericHistogram
extends Object

A generic, re-usable histogram class that supports partial aggregations. The algorithm is a heuristic adapted from the following paper: Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number of histogram bins.


Constructor Summary
NumericHistogram()
          Creates a new histogram object.
 
Method Summary
 void add(double v)
          Adds a new data point to the histogram approximation.
 void allocate(int num_bins)
          Sets the number of histogram bins to use for approximating data.
 org.apache.hadoop.hive.ql.udf.generic.NumericHistogram.Coord getBin(int b)
          Returns a particular histogram bin.
 int getNumBins()
           
 int getUsedBins()
          Returns the number of bins currently being used by the histogram.
 boolean isReady()
          Returns true if this histogram object has been initialized by calling merge() or allocate().
 void merge(List<DoubleWritable> other)
          Takes a serialized histogram created by the serialize() method and merges it with the current histogram object.
 double quantile(double q)
          Gets an approximate quantile value from the current histogram.
 void reset()
          Resets a histogram object to its initial state.
 ArrayList<DoubleWritable> serialize()
          In preparation for a Hive merge() call, serializes the current histogram object into an ArrayList of DoubleWritable objects.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

NumericHistogram

public NumericHistogram()
Creates a new histogram object. Note that the allocate() or merge() method must be called before the histogram can be used.

Method Detail

reset

public void reset()
Resets a histogram object to its initial state. allocate() or merge() must be called again before use.


getUsedBins

public int getUsedBins()
Returns the number of bins currently being used by the histogram.


isReady

public boolean isReady()
Returns true if this histogram object has been initialized by calling merge() or allocate().


getBin

public org.apache.hadoop.hive.ql.udf.generic.NumericHistogram.Coord getBin(int b)
Returns a particular histogram bin.


allocate

public void allocate(int num_bins)
Sets the number of histogram bins to use for approximating data.

Parameters:
num_bins - Number of non-uniform-width histogram bins to use

merge

public void merge(List<DoubleWritable> other)
Takes a serialized histogram created by the serialize() method and merges it with the current histogram object.

Parameters:
other - A serialized histogram created by the serialize() method
See Also:
merge(java.util.List)

add

public void add(double v)
Adds a new data point to the histogram approximation. Make sure you have called either allocate() or merge() first. This method implements Algorithm #1 from Ben-Haim and Tom-Tov, "A Streaming Parallel Decision Tree Algorithm", JMLR 2010.

Parameters:
v - The data point to add to the histogram approximation.

quantile

public double quantile(double q)
Gets an approximate quantile value from the current histogram. Some popular quantiles are 0.5 (median), 0.95, and 0.98.

Parameters:
q - The requested quantile, must be strictly within the range (0,1).
Returns:
The quantile value.

serialize

public ArrayList<DoubleWritable> serialize()
In preparation for a Hive merge() call, serializes the current histogram object into an ArrayList of DoubleWritable objects. This list is deserialized and merged by the merge method.

Returns:
An ArrayList of Hadoop DoubleWritable objects that represents the current histogram.
See Also:
merge(java.util.List)

getNumBins

public int getNumBins()


Copyright © 2014 The Apache Software Foundation. All rights reserved.