CUME_DIST

Returns the cumulative distribution of a value. The value for each row in the result set is greater than 0 and less than or equal to 1.

Syntax:

CUME_DIST (expr)
  OVER ([partition_by_clause] order_by_clause)

The ORDER BY clause is required. The PARTITION BY clause is optional. The window clause is not allowed.

Usage notes:

Within each partition of the result set, the CUME_DIST() value represents an ascending sequence that ends at 1. Each value represents the proportion of rows in the partition whose values are less than or equal to the value in the current row.

If the sequence of input values contains ties, the CUME_DIST() results are identical for the tied values.

Impala only supports the CUME_DIST() function in an analytic context, not as a regular aggregate function.

Examples:

This example uses a table with 9 rows. The CUME_DIST() function evaluates the entire table because there is no PARTITION BY clause, with the rows ordered by the weight of the animal. the sequence of values shows that 1/9 of the values are less than or equal to the lightest animal (mouse), 2/9 of the values are less than or equal to the second-lightest animal, and so on up to the heaviest animal (elephant), where 9/9 of the rows are less than or equal to its weight.

create table animals (name string, kind string, kilos decimal(9,3));
insert into animals values
  ('Elephant', 'Mammal', 4000), ('Giraffe', 'Mammal', 1200), ('Mouse', 'Mammal', 0.020),
  ('Condor', 'Bird', 15), ('Horse', 'Mammal', 500), ('Owl', 'Bird', 2.5),
  ('Ostrich', 'Bird', 145), ('Polar bear', 'Mammal', 700), ('Housecat', 'Mammal', 5);

select name, cume_dist() over (order by kilos) from animals;
+------------+-----------------------+
| name       | cume_dist() OVER(...) |
+------------+-----------------------+
| Elephant   | 1                     |
| Giraffe    | 0.8888888888888888    |
| Polar bear | 0.7777777777777778    |
| Horse      | 0.6666666666666666    |
| Ostrich    | 0.5555555555555556    |
| Condor     | 0.4444444444444444    |
| Housecat   | 0.3333333333333333    |
| Owl        | 0.2222222222222222    |
| Mouse      | 0.1111111111111111    |
+------------+-----------------------+

Using a PARTITION BY clause produces a separate sequence for each partition group, in this case one for mammals and one for birds. Because there are 3 birds and 6 mammals, the sequence illustrates how 1/3 of the Bird rows have a kilos value that is less than or equal to the lightest bird, 1/6 of the Mammal rows have a kilos value that is less than or equal to the lightest mammal, and so on until both the heaviest bird and heaviest mammal have a CUME_DIST() value of 1.

select name, kind, cume_dist() over (partition by kind order by kilos) from animals
+------------+--------+-----------------------+
| name       | kind   | cume_dist() OVER(...) |
+------------+--------+-----------------------+
| Ostrich    | Bird   | 1                     |
| Condor     | Bird   | 0.6666666666666666    |
| Owl        | Bird   | 0.3333333333333333    |
| Elephant   | Mammal | 1                     |
| Giraffe    | Mammal | 0.8333333333333334    |
| Polar bear | Mammal | 0.6666666666666666    |
| Horse      | Mammal | 0.5                   |
| Housecat   | Mammal | 0.3333333333333333    |
| Mouse      | Mammal | 0.1666666666666667    |
+------------+--------+-----------------------+

We can reverse the ordering within each partition group by using an ORDER BY ... DESC clause within the OVER() clause. Now the lightest (smallest value of kilos) animal of each kind has a CUME_DIST() value of 1.

select name, kind, cume_dist() over (partition by kind order by kilos desc) from animals
+------------+--------+-----------------------+
| name       | kind   | cume_dist() OVER(...) |
+------------+--------+-----------------------+
| Owl        | Bird   | 1                     |
| Condor     | Bird   | 0.6666666666666666    |
| Ostrich    | Bird   | 0.3333333333333333    |
| Mouse      | Mammal | 1                     |
| Housecat   | Mammal | 0.8333333333333334    |
| Horse      | Mammal | 0.6666666666666666    |
| Polar bear | Mammal | 0.5                   |
| Giraffe    | Mammal | 0.3333333333333333    |
| Elephant   | Mammal | 0.1666666666666667    |
+------------+--------+-----------------------+

The following example manufactures some rows with identical values in the kilos column, to demonstrate how the results look in case of tie values. For simplicity, it only shows the CUME_DIST() sequence for the Bird rows. Now with 3 rows all with a value of 15, all of those rows have the same CUME_DIST() value. 4/5 of the rows have a value for kilos that is less than or equal to 15.

insert into animals values ('California Condor', 'Bird', 15), ('Andean Condor', 'Bird', 15)

select name, kind, cume_dist() over (order by kilos) from animals where kind = 'Bird';
+-------------------+------+-----------------------+
| name              | kind | cume_dist() OVER(...) |
+-------------------+------+-----------------------+
| Ostrich           | Bird | 1                     |
| Condor            | Bird | 0.8                   |
| California Condor | Bird | 0.8                   |
| Andean Condor     | Bird | 0.8                   |
| Owl               | Bird | 0.2                   |
+-------------------+------+-----------------------+

The following example shows how to use an ORDER BY clause in the outer block to order the result set in case of ties. Here, all the Bird rows are together, then in descending order by the result of the CUME_DIST() function, and all tied CUME_DIST() values are ordered by the animal name.

select name, kind, cume_dist() over (partition by kind order by kilos) as ordering
  from animals
where
  kind = 'Bird'
order by kind, ordering desc, name;
+-------------------+------+----------+
| name              | kind | ordering |
+-------------------+------+----------+
| Ostrich           | Bird | 1        |
| Andean Condor     | Bird | 0.8      |
| California Condor | Bird | 0.8      |
| Condor            | Bird | 0.8      |
| Owl               | Bird | 0.2      |
+-------------------+------+----------+