Managing Apache ImpalaPDF version

Using KLL Datasketch Algorithms in Impala

You can use the stochastic streaming algorithm (KLL) that uses the percentile/quantile functions to statistically analyze the approximate distribution of comparable data from a very large stream.

Use the ds_kll_quantiles() or its inverse functions the Probability Mass Function ds_kll_PMF() and the Cumulative Distribution Function ds_kll_CDF() to obtain the analysis. Query results produced by this KLL algorithm are approximate but within well defined error bounds.

KLL function currently supports only floats. If you sketch an int data the KLL function converts it into float.

You can sketch data from any file formats. However it is recommended to save the sketches into parquet tables. It is not recommended to save the sketches to text format.

This datasketch algorithm works in two phases.

  • First it creates a sketch from the dataset provided for the algorithm.
  • As the second step in the process you can read the estimate from the sketch.

Using this approach you can create granular sketches for a specific dataset (e.g. one sketch per partition) and later use them to provide answers for different quantile queries without reading the actual data as they can also be merged together. Using KLL you can create sketches by Hive and read it with Impala for getting the estimate and vice versa.

Creating a sketch

The following example shows how to create a sketch using the ds_kll_sketch function from a column (float_col) containing the supported data type. This created sketch is a data structure containing an approximate calculation of some facet of the data and which are compatible regardless of which system reads or writes them.

select ds_kll_sketch(float_col) from tableName;

where ds_kll_sketch() is an aggregate function that receives a float parameter (e.g. a float column of a table) and returns a serialized Apache DataSketches KLL sketch of the input data set wrapped into STRING type. Since the serialized data might contain unsupported characters this query may error out. So another approach to this method is to insert the derived sketch into a table or a view and later use for quantile approximations.

Inserting the derived sketch into a table

If you save the sketches to a table or view with the same granularity (e.g. one sketch for each partition in a table) then later you can simply read the sketch for quantile approximation.

insert into table_name select ds_kll_sketch(float_col) from tableName;

Debugging the created sketch

To troubleshoot the sketch created by the ds_kll_sketch function, use ds_kll_stringify on the sketch. ds_kll_stringify() receives a string that is a serialized Apache DataSketches sketch and returns its stringified format by invoking the related function on the sketch's interface.

select ds_kll_stringify(ds_kll_sketch(float_col)) from tableName;
    +-------------------------------------------+
    | ds_kll_stringify(ds_kll_sketch(float_col))|
    +-------------------------------------------+
    | ### KLL sketch summary:                   |
    |    K                     : 200            |
    |    min K                 : 200            |
    |    M                     : 8              |
    |    N                     : 100            |
    |    Epsilon               : 1.33%          |
    |    Epsilon PMF           : 1.65%          |
    |    Empty                 : false          |
    |    Estimation mode       : false          |
    |    Levels                : 1              |
    |    Sorted                : false          |
    |    Capacity items        : 200            |
    |    Retained items        : 100            |
    |    Storage bytes         : 432            |
    |    Min value             : 0              |
    |    Max value             : 9              |
    | ### End sketch summary                    |
    +-------------------------------------------+
   

Determining the number of datasets sketched

To determine the number of records sketched into this sketch, use ds_kll_n function.

select ds_kll_n(ds_kll_sketch(float_col)) from tableName;
    +------------------------------------+
    | ds_kll_n(ds_kll_sketch(float_col)) |
    +------------------------------------+
    | 100                                |
    +------------------------------------+  
   

This function ds_kll_quantile() function receives two parameters, a STRING parameter that contains a serialized KLL sketch and a DOUBLE (0.5) that represents the rank of the quantile in the range of [0,1]. E.g. rank=0.5 means the approximate median of all the sketched data.

    +------------------------------------------------+
    | ds_kll_quantile(ds_kll_sketch(float_col), 0.5) |
    +------------------------------------------------+
    | 4.0                                            |
    +------------------------------------------------+
   

This query returns a data (4.0) that is bigger than the 50% of the data.

Calculating quantiles

To calculate the quantiles for multiple rank parameters, use the function ds_kll_quantiles_as_string () that is very similar to ds_kll_quantile() but this function receives any number of rank parameters and returns a comma separated string that holds the results for all of the given ranks.

    select ds_kll_quantiles_as_string(ds_kll_sketch(float_col), 0.5, 0.6, 1) from tableName;
   
    +-------------------------------------------------------------------+
    | ds_kll_quantiles_as_string(ds_kll_sketch(float_col), 0.5, 0.6, 1) |
    +-------------------------------------------------------------------+
    | 4, 5, 9                                                           |
    +-------------------------------------------------------------------+
   

This rank function ds_kll_rank() receives two parameters, a STRING that represents a serialized DataSketches KLL sketch and a float to provide a probing value in the sketch.

    select ds_kll_rank(ds_kll_sketch(float_col), 4) from tableName;
   
    +------------------------------------------------+
    | ds_kll_rank(ds_kll_sketch(float_col), 4)       |
    +------------------------------------------------+
    | 0.48                                           |
    +------------------------------------------------+
   

This query returns a DOUBLE that is the rank of the given probing value in the range of [0,1]. E.g. a return value of 0.48 means that the probing value given as parameter is greater than 48% of all the values in the sketch.

This Probabilistic Mass Function (PMF) ds_kll_pmf_as_string()receives a serialized KLL sketch and one or more float values to represent ranges in the sketched values. In the following example, the float values [1, 3, 4, 8] represent the following ranges: (-inf, 1), [1, 3), [3, 4), [4, 8) [8, +inf)

select ds_kll_pmf_as_string(ds_kll_sketch(float_col), 1, 3, 4, 8) from tableName
    +------------------------------------------------------------+
    | ds_kll_pmf_as_string(ds_kll_sketch(float_col), 1, 3, 4, 8) |
    +------------------------------------------------------------+
    | 0.12, 0.24, 0.12, 0.36, 0.16                               |
    +------------------------------------------------------------+
   

This query returns a comma separated string where each value in the string is a number in the range of [0,1] and shows what percentage of the data is in the particular ranges.

This Cumulative Distribution Function (CDF) ds_kll_cmf_as_string()receives a serialized KLL sketch and one or more float values to represent ranges in the sketched values. In the following example, the float values [1, 3, 4, 8] represents the following ranges: (-inf, 1), (-inf, 3), (-inf, 4), (-inf, 8), (-inf, +inf)

select ds_kll_cdf_as_string(ds_kll_sketch(float_col), 1, 3, 4, 8) from tableName;
    +------------------------------------------------------------+
    | ds_kll_cdf_as_string(ds_kll_sketch(float_col), 1, 3, 4, 8) |
    +------------------------------------------------------------+
    | 0.12, 0.36, 0.48, 0.84, 1                                  |
    +------------------------------------------------------------+
   

This query returns a comma separated string where each value in the string is a number in the range of [0,1] and shows what percentage of the data is in the particular ranges.

To take advantage of the UNION function, create granular sketches for a specific dataset (one sketch per partition), write these sketches to a separate table and based on the partition you are interested in, you can UNION the relevant sketches together to get an estimate.

insert into sketch_tbl select ds_kll_sketch(float_col) from tableName;
select ds_kll_quantile(ds_kll_union(sketch_col), 0.5) from sketch_tbl
    where partition_col=1 OR partition_col=5;
   

This function ds_kll_union() receives a set of serialized Apache DataSketches KLL sketches produced by ds_kll_sketch() and merges them into a single sketch.