org.apache.hadoop.hive.ql.udf.generic
Class NGramEstimator

java.lang.Object
  extended by org.apache.hadoop.hive.ql.udf.generic.NGramEstimator

public class NGramEstimator
extends Object

A generic, re-usable n-gram estimation class that supports partial aggregations. The algorithm is based on the heuristic from the following paper: Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872. In particular, it is guaranteed that frequencies will be under-counted. With large data and a reasonable precision factor, this undercounting appears to be on the order of 5%.


Constructor Summary
NGramEstimator()
          Creates a new n-gram estimator object.
 
Method Summary
 void add(ArrayList<String> ng)
          Adds a new n-gram to the estimation.
 ArrayList<Object[]> getNGrams()
          Returns the final top-k n-grams in a format suitable for returning to Hive.
 void initialize(int pk, int ppf, int pn)
          Sets the 'k' and 'pf' parameters.
 boolean isInitialized()
          Returns true if the 'k' and 'pf' parameters have been set.
 void merge(List<org.apache.hadoop.io.Text> other)
          Takes a serialized n-gram estimator object created by the serialize() method and merges it with the current n-gram object.
 void reset()
          Resets an n-gram estimator object to its initial state.
 ArrayList<org.apache.hadoop.io.Text> serialize()
          In preparation for a Hive merge() call, serializes the current n-gram estimator object into an ArrayList of Text objects.
 int size()
          Returns the number of n-grams in our buffer.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

NGramEstimator

public NGramEstimator()
Creates a new n-gram estimator object. The 'n' for n-grams is computed dynamically when data is fed to the object.

Method Detail

isInitialized

public boolean isInitialized()
Returns true if the 'k' and 'pf' parameters have been set.


initialize

public void initialize(int pk,
                       int ppf,
                       int pn)
                throws HiveException
Sets the 'k' and 'pf' parameters.

Throws:
HiveException

reset

public void reset()
Resets an n-gram estimator object to its initial state.


getNGrams

public ArrayList<Object[]> getNGrams()
                              throws HiveException
Returns the final top-k n-grams in a format suitable for returning to Hive.

Throws:
HiveException

size

public int size()
Returns the number of n-grams in our buffer.


add

public void add(ArrayList<String> ng)
         throws HiveException
Adds a new n-gram to the estimation.

Parameters:
ng - The n-gram to add to the estimation
Throws:
HiveException

merge

public void merge(List<org.apache.hadoop.io.Text> other)
           throws HiveException
Takes a serialized n-gram estimator object created by the serialize() method and merges it with the current n-gram object.

Parameters:
other - A serialized n-gram object created by the serialize() method
Throws:
HiveException

serialize

public ArrayList<org.apache.hadoop.io.Text> serialize()
                                               throws HiveException
In preparation for a Hive merge() call, serializes the current n-gram estimator object into an ArrayList of Text objects. This list is deserialized and merged by the merge method.

Returns:
An ArrayList of Hadoop Text objects that represents the current n-gram estimation.
Throws:
HiveException
See Also:
merge(java.util.List)


Copyright © 2014 The Apache Software Foundation. All rights reserved.