NDV function
An aggregate function that returns an approximate value similar to the result of
COUNT(DISTINCT col)
, the number of distinct values
.
It is much faster than the combination of COUNT
and DISTINCT
,
and uses a constant amount of memory and thus is less memory-intensive for columns with high
cardinality.
Syntax:
NDV([DISTINCT | ALL] expression [,scale])
precision = scale + 8
Therefore a scale of 1 is mapped to a precision of 9 and a scale of 10 is mapped to a precision of 18.
Without the optional argument, the precision which determines the total number of different estimators in the HLL algorithm will be still 10.
A large precision value generally produces a better estimation with less error than a small precision value. This is due to the extra number of estimators involved. The expense is at the need of extra memory. For a given precision p, the amount of memory used by the HLL algorithm is in the order of 2^p bytes.
When provided a scale of 10 against a total of 22 distinct data sets loaded into external Impala tables, the error will be computed as abs(<true_unique_value> - <estimated_unique_value>) / <true_unique_value>
The scale of 10, mapped to the precision of 18, yielded the worst estimation error at 0.42% (for one set of 10 million integers), and average error no more than 0.17%. This was at the cost of 256Kb of memory for the internal data structure per evaluation of the HLL algorithm.
Usage notes:
This is the mechanism used internally by the COMPUTE STATS
statement for computing the
number of distinct values in a column.
Because this number is an estimate, it might not reflect the precise number of different values in the column, especially if the cardinality is very low or very high. If the estimated number is higher than the number of rows in the table, Impala adjusts the value internally during query planning.
Return type: DOUBLE
in Impala 2.0 and higher; STRING
in earlier
releases
Complex type considerations:
To access a column with a complex
type (ARRAY
, STRUCT
, or MAP
) in an
aggregation function, you unpack the individual elements using join notation in the query,
and then apply the function to the final scalar item, field, key, or value at the bottom of
any nested type hierarchy in the column.
ARRAY
of STRUCT
items).
The array is unpacked inside the query using join notation.
The array elements are referenced using the ITEM
pseudocolumn, and the structure fields inside the array elements
are referenced using dot notation.
Numeric values such as SUM()
and AVG()
are computed using the numeric R_NATIONKEY
field, and
the general-purpose MAX()
and MIN()
values are computed from the string N_NAME
field.
describe region;
+-------------+-------------------------+---------+
| name | type | comment |
+-------------+-------------------------+---------+
| r_regionkey | smallint | |
| r_name | string | |
| r_comment | string | |
| r_nations | array<struct< | |
| | n_nationkey:smallint, | |
| | n_name:string, | |
| | n_comment:string | |
| | >> | |
+-------------+-------------------------+---------+
select r_name, r_nations.item.n_nationkey
from region, region.r_nations as r_nations
order by r_name, r_nations.item.n_nationkey;
+-------------+------------------+
| r_name | item.n_nationkey |
+-------------+------------------+
| AFRICA | 0 |
| AFRICA | 5 |
| AFRICA | 14 |
| AFRICA | 15 |
| AFRICA | 16 |
| AMERICA | 1 |
| AMERICA | 2 |
| AMERICA | 3 |
| AMERICA | 17 |
| AMERICA | 24 |
| ASIA | 8 |
| ASIA | 9 |
| ASIA | 12 |
| ASIA | 18 |
| ASIA | 21 |
| EUROPE | 6 |
| EUROPE | 7 |
| EUROPE | 19 |
| EUROPE | 22 |
| EUROPE | 23 |
| MIDDLE EAST | 4 |
| MIDDLE EAST | 10 |
| MIDDLE EAST | 11 |
| MIDDLE EAST | 13 |
| MIDDLE EAST | 20 |
+-------------+------------------+
select
r_name,
count(r_nations.item.n_nationkey) as count,
sum(r_nations.item.n_nationkey) as sum,
avg(r_nations.item.n_nationkey) as avg,
min(r_nations.item.n_name) as minimum,
max(r_nations.item.n_name) as maximum,
ndv(r_nations.item.n_nationkey) as distinct_vals
from
region, region.r_nations as r_nations
group by r_name
order by r_name;
+-------------+-------+-----+------+-----------+----------------+---------------+
| r_name | count | sum | avg | minimum | maximum | distinct_vals |
+-------------+-------+-----+------+-----------+----------------+---------------+
| AFRICA | 5 | 50 | 10 | ALGERIA | MOZAMBIQUE | 5 |
| AMERICA | 5 | 47 | 9.4 | ARGENTINA | UNITED STATES | 5 |
| ASIA | 5 | 68 | 13.6 | CHINA | VIETNAM | 5 |
| EUROPE | 5 | 77 | 15.4 | FRANCE | UNITED KINGDOM | 5 |
| MIDDLE EAST | 5 | 58 | 11.6 | EGYPT | SAUDI ARABIA | 5 |
+-------------+-------+-----+------+-----------+----------------+---------------+
Restrictions:
This function cannot be used in an analytic context. That is, the OVER()
clause is not allowed at all with this function.
Examples:
The following example queries a billion-row table to illustrate the relative performance of
COUNT(DISTINCT)
and NDV()
. It shows how COUNT(DISTINCT)
gives a precise answer, but is inefficient for large-scale data where an approximate result is sufficient.
The NDV()
function gives an approximate result but is much faster.
select count(distinct col1) from sample_data;
+---------------------+
| count(distinct col1)|
+---------------------+
| 100000 |
+---------------------+
Fetched 1 row(s) in 20.13s
select cast(ndv(col1) as bigint) as col1 from sample_data;
+----------+
| col1 |
+----------+
| 139017 |
+----------+
Fetched 1 row(s) in 8.91s
The following example shows how you can code multiple NDV()
calls in a single query, to
easily learn which columns have substantially more or fewer distinct values. This technique is faster than
running a sequence of queries with COUNT(DISTINCT)
calls.
select cast(ndv(col1) as bigint) as col1, cast(ndv(col2) as bigint) as col2,
cast(ndv(col3) as bigint) as col3, cast(ndv(col4) as bigint) as col4
from sample_data;
+----------+-----------+------------+-----------+
| col1 | col2 | col3 | col4 |
+----------+-----------+------------+-----------+
| 139017 | 282 | 46 | 145636240 |
+----------+-----------+------------+-----------+
Fetched 1 row(s) in 34.97s
select count(distinct col1) from sample_data;
+---------------------+
| count(distinct col1)|
+---------------------+
| 100000 |
+---------------------+
Fetched 1 row(s) in 20.13s
select count(distinct col2) from sample_data;
+----------------------+
| count(distinct col2) |
+----------------------+
| 278 |
+----------------------+
Fetched 1 row(s) in 20.09s
select count(distinct col3) from sample_data;
+-----------------------+
| count(distinct col3) |
+-----------------------+
| 46 |
+-----------------------+
Fetched 1 row(s) in 19.12s
select count(distinct col4) from sample_data;
+----------------------+
| count(distinct col4) |
+----------------------+
| 147135880 |
+----------------------+
Fetched 1 row(s) in 266.95s