Impala Analytic Functions
Analytic functions (also known as window functions) are a special category of built-in functions. Like aggregate functions, they examine the contents of multiple input rows to compute each output value. However, rather than being limited to one result value per GROUP BY group, they operate on windows where the input rows are ordered and grouped using flexible conditions expressed through an OVER() clause.
Added in: CDH 5.2.0 (Impala 2.0.0)
Some functions, such as LAG() and RANK(), can only be used in this analytic context. Some aggregate functions do double duty: when you call the aggregation functions such as MAX(), SUM(), AVG(), and so on with an OVER() clause, they produce an output value for each row, based on computations across other rows in the window.
Although analytic functions often compute the same value you would see from an aggregate function in a GROUP BY query, the analytic functions produce a value for each row in the result set rather than a single value for each group. This flexibility lets you include additional columns in the SELECT list, offering more opportunities for organizing and filtering the result set.
Analytic function calls are only allowed in the SELECT list and in the outermost ORDER BY clause of the query. During query processing, analytic functions are evaluated after other query stages such as joins, WHERE, and GROUP BY,
The rows that are part of each partition are analyzed by computations across an ordered or unordered set of rows. For example, COUNT() and SUM() might be applied to all the rows in the partition, in which case the order of analysis does not matter. The ORDER BY clause might be used inside the OVER() clause to defines the ordering that applies to functions such as LAG() and FIRST_VALUE().
Analytic functions are frequently used in fields such as finance and science to provide trend, outlier, and bucketed analysis for large data sets. You might also see the term "window functions" in database literature, referring to the sequence of rows (the "window") that the function call applies to, particularly when the OVER clause includes a ROWS or RANGE keyword.
The following sections describe the analytic query clauses and the pure analytic functions provided by Impala. For usage information about aggregate functions in an analytic context, see Impala Aggregate Functions.
Continue reading:
- OVER Clause
- Window Clause
- AVG() Function - Analytic Context
- COUNT() Function - Analytic Context
- DENSE_RANK() Function
- FIRST_VALUE() Function
- LAG() Function
- LAST_VALUE() Function
- LEAD() Function
- MAX() Function - Analytic Context
- MIN() Function - Analytic Context
- RANK() Function
- ROW_NUMBER() Function
- SUM() Function - Analytic Context
OVER Clause
The OVER clause is required for calls to pure analytic functions such as LEAD(), RANK(), and FIRST_VALUE(). When you include an OVER clause with calls to aggregate functions such as MAX(), COUNT(), or SUM(), they operate as analytic functions.
Syntax:
function(args) OVER([partition_by_clause] [order_by_clause [window_clause]]) partition_by_clause ::= PARTITION BY expr [, expr ...] order_by_clause ::= ORDER BY expr [ASC | DESC] [NULLS FIRST | NULLS LAST] [, expr [ASC | DESC] [NULLS FIRST | NULLS LAST] ...] window_clause: See Window Clause
PARTITION BY clause:
The PARTITION BY clause acts much like the GROUP BY clause in the outermost block of a query. It divides the rows into groups containing identical values in one or more columns. These logical groups are known as partitions. Throughout the discussion of analytic functions, "partitions" refers to the groups produced by the PARTITION BY clause, not to partitioned tables. However, note the following limitation that applies specifically to analytic function calls involving partitioned tables.
In queries involving both analytic functions and partitioned tables, partition pruning only occurs for columns named in the PARTITION BY clause of the analytic function call. For example, if an analytic function query has a clause such as WHERE year=2016, the way to make the query prune all other YEAR partitions is to include PARTITION BY year in the analytic function call; for example, OVER (PARTITION BY year,other_columns other_analytic_clauses).
The sequence of results from an analytic function "resets" for each new partition in the result set. That is, the set of preceding or following rows considered by the analytic function always come from a single partition. Any MAX(), SUM(), ROW_NUMBER(), and so on apply to each partition independently. Omit the PARTITION BY clause to apply the analytic operation to all the rows in the table.
ORDER BY clause:
The ORDER BY clause works much like the ORDER BY clause in the outermost block of a query. It defines the order in which rows are evaluated for the entire input set, or for each group produced by a PARTITION BY clause. You can order by one or multiple expressions, and for each expression optionally choose ascending or descending order and whether nulls come first or last in the sort order. Because this ORDER BY clause only defines the order in which rows are evaluated, if you want the results to be output in a specific order, also include an ORDER BY clause in the outer block of the query.
When the ORDER BY clause is omitted, the analytic function applies to all items in the group produced by the PARTITION BY clause. When the ORDER BY clause is included, the analysis can apply to all or a subset of the items in the group, depending on the optional window clause.
The order in which the rows are analyzed is only defined for those columns specified in ORDER BY clauses.
One difference between the analytic and outer uses of the ORDER BY clause: inside the OVER clause, ORDER BY 1 or other integer value is interpreted as a constant sort value (effectively a no-op) rather than referring to column 1.
Window clause:
The window clause is only allowed in combination with an ORDER BY clause. If the ORDER BY clause is specified but the window clause is not, the default window is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. See Window Clause for full details.
HBase considerations:
Because HBase tables are optimized for single-row lookups rather than full scans, analytic functions using the OVER() clause are not recommended for HBase tables. Although such queries work, their performance is lower than on comparable tables using HDFS data files.
Parquet considerations:
Analytic functions are very efficient for Parquet tables. The data that is examined during evaluation of the OVER() clause comes from a specified set of columns, and the values for each column are arranged sequentially within each data file.
Text table considerations:
Analytic functions are convenient to use with text tables for exploratory business intelligence. When the volume of data is substantial, prefer to use Parquet tables for performance-critical analytic queries.
Added in: CDH 5.2.0 (Impala 2.0.0)
Examples:
The following example shows how to synthesize a numeric sequence corresponding to all the rows in a table. The new table has the same columns as the old one, plus an additional column ID containing the integers 1, 2, 3, and so on, corresponding to the order of a TIMESTAMP column in the original table.
CREATE TABLE events_with_id AS SELECT row_number() OVER (ORDER BY date_and_time) AS id, c1, c2, c3, c4 FROM events;
The following example shows how to determine the number of rows containing each value for a column. Unlike a corresponding GROUP BY query, this one can analyze a single column and still return all values (not just the distinct ones) from the other columns.
SELECT x, y, z, count() OVER (PARTITION BY x) AS how_many_x FROM t1;
Restrictions:
You cannot directly combine the DISTINCT operator with analytic function calls. You can put the analytic function call in a WITH clause or an inline view, and apply the DISTINCT operator to its result set.
WITH t1 AS (SELECT x, sum(x) OVER (PARTITION BY x) AS total FROM t1) SELECT DISTINCT x, total FROM t1;
Window Clause
Certain analytic functions accept an optional window clause, which makes the function analyze only certain rows "around" the current row rather than all rows in the partition. For example, you can get a moving average by specifying some number of preceding and following rows, or a running count or running total by specifying all rows up to the current position. This clause can result in different analytic results for rows within the same partition.
The window clause is supported with the AVG(), COUNT(), FIRST_VALUE(), LAST_VALUE(), and SUM() functions. For MAX() and MIN(), the window clause only allowed if the start bound is UNBOUNDED PRECEDING
Syntax:
ROWS BETWEEN [ { m | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | n } FOLLOWING] ] RANGE BETWEEN [ {m | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | n } FOLLOWING] ]
ROWS BETWEEN defines the size of the window in terms of the indexes of the rows in the result set. The size of the window is predictable based on the clauses the position within the result set.
RANGE BETWEEN does not currently support numeric arguments to define a variable-size sliding window.
Currently, Impala supports only some combinations of arguments to the RANGE clause:
- RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (the default when ORDER BY is specified and the window clause is omitted)
- RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
- RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
When RANGE is used, CURRENT ROW includes not just the current row but all rows that are tied with the current row based on the ORDER BY expressions.
Added in: CDH 5.2.0 (Impala 2.0.0)
Examples:
The following examples show financial data for a fictional stock symbol JDR. The closing price moves up and down each day.
create table stock_ticker (stock_symbol string, closing_price decimal(8,2), closing_date timestamp); ...load some data... select * from stock_ticker order by stock_symbol, closing_date +--------------+---------------+---------------------+ | stock_symbol | closing_price | closing_date | +--------------+---------------+---------------------+ | JDR | 12.86 | 2014-10-02 00:00:00 | | JDR | 12.89 | 2014-10-03 00:00:00 | | JDR | 12.94 | 2014-10-04 00:00:00 | | JDR | 12.55 | 2014-10-05 00:00:00 | | JDR | 14.03 | 2014-10-06 00:00:00 | | JDR | 14.75 | 2014-10-07 00:00:00 | | JDR | 13.98 | 2014-10-08 00:00:00 | +--------------+---------------+---------------------+
The queries use analytic functions with window clauses to compute moving averages of the closing price. For example, ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING produces an average of the value from a 3-day span, producing a different value for each row. The first row, which has no preceding row, only gets averaged with the row following it. If the table contained more than one stock symbol, the PARTITION BY clause would limit the window for the moving average to only consider the prices for a single stock.
select stock_symbol, closing_date, closing_price, avg(closing_price) over (partition by stock_symbol order by closing_date rows between 1 preceding and 1 following) as moving_average from stock_ticker; +--------------+---------------------+---------------+----------------+ | stock_symbol | closing_date | closing_price | moving_average | +--------------+---------------------+---------------+----------------+ | JDR | 2014-10-02 00:00:00 | 12.86 | 12.87 | | JDR | 2014-10-03 00:00:00 | 12.89 | 12.89 | | JDR | 2014-10-04 00:00:00 | 12.94 | 12.79 | | JDR | 2014-10-05 00:00:00 | 12.55 | 13.17 | | JDR | 2014-10-06 00:00:00 | 14.03 | 13.77 | | JDR | 2014-10-07 00:00:00 | 14.75 | 14.25 | | JDR | 2014-10-08 00:00:00 | 13.98 | 14.36 | +--------------+---------------------+---------------+----------------+
The clause ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW produces a cumulative moving average, from the earliest data up to the value for each day.
select stock_symbol, closing_date, closing_price, avg(closing_price) over (partition by stock_symbol order by closing_date rows between unbounded preceding and current row) as moving_average from stock_ticker; +--------------+---------------------+---------------+----------------+ | stock_symbol | closing_date | closing_price | moving_average | +--------------+---------------------+---------------+----------------+ | JDR | 2014-10-02 00:00:00 | 12.86 | 12.86 | | JDR | 2014-10-03 00:00:00 | 12.89 | 12.87 | | JDR | 2014-10-04 00:00:00 | 12.94 | 12.89 | | JDR | 2014-10-05 00:00:00 | 12.55 | 12.81 | | JDR | 2014-10-06 00:00:00 | 14.03 | 13.05 | | JDR | 2014-10-07 00:00:00 | 14.75 | 13.33 | | JDR | 2014-10-08 00:00:00 | 13.98 | 13.42 | +--------------+---------------------+---------------+----------------+
AVG() Function - Analytic Context
You can include an OVER clause with a call to this function to use it as an analytic function. See AVG Function for details and examples.
COUNT() Function - Analytic Context
You can include an OVER clause with a call to this function to use it as an analytic function. See COUNT Function for details and examples.
DENSE_RANK() Function
Returns an ascending sequence of integers, starting with 1. The output sequence produces duplicate integers for duplicate values of the ORDER BY expressions. After generating duplicate output values for the "tied" input values, the function continues the sequence with the next higher integer. Therefore, the sequence contains duplicates but no gaps when the input contains duplicates. Starts the sequence over for each group produced by the PARTITIONED BY clause.
Syntax:
DENSE_RANK() OVER([partition_by_clause] order_by_clause)
The PARTITION BY clause is optional. The ORDER BY clause is required. The window clause is not allowed.
Usage notes:
Often used for top-N and bottom-N queries. For example, it could produce a "top 10" report including all the items with the 10 highest values, even if several items tied for 1st place.
Similar to ROW_NUMBER and RANK. These functions differ in how they treat duplicate combinations of values.
Added in: CDH 5.2.0 (Impala 2.0.0)
Examples:
The following example demonstrates how the DENSE_RANK() function identifies where each value "places" in the result set, producing the same result for duplicate values, but with a strict sequence from 1 to the number of groups. For example, when results are ordered by the X column, both 1 values are tied for first; both 2 values are tied for second; and so on.
select x, dense_rank() over(order by x) as rank, property from int_t; +----+------+----------+ | x | rank | property | +----+------+----------+ | 1 | 1 | square | | 1 | 1 | odd | | 2 | 2 | even | | 2 | 2 | prime | | 3 | 3 | prime | | 3 | 3 | odd | | 4 | 4 | even | | 4 | 4 | square | | 5 | 5 | odd | | 5 | 5 | prime | | 6 | 6 | even | | 6 | 6 | perfect | | 7 | 7 | lucky | | 7 | 7 | lucky | | 7 | 7 | lucky | | 7 | 7 | odd | | 7 | 7 | prime | | 8 | 8 | even | | 9 | 9 | square | | 9 | 9 | odd | | 10 | 10 | round | | 10 | 10 | even | +----+------+----------+
The following examples show how the DENSE_RANK() function is affected by the PARTITION property within the ORDER BY clause.
Partitioning by the PROPERTY column groups all the even, odd, and so on values together, and DENSE_RANK() returns the place of each value within the group, producing several ascending sequences.
select x, dense_rank() over(partition by property order by x) as rank, property from int_t; +----+------+----------+ | x | rank | property | +----+------+----------+ | 2 | 1 | even | | 4 | 2 | even | | 6 | 3 | even | | 8 | 4 | even | | 10 | 5 | even | | 7 | 1 | lucky | | 7 | 1 | lucky | | 7 | 1 | lucky | | 1 | 1 | odd | | 3 | 2 | odd | | 5 | 3 | odd | | 7 | 4 | odd | | 9 | 5 | odd | | 6 | 1 | perfect | | 2 | 1 | prime | | 3 | 2 | prime | | 5 | 3 | prime | | 7 | 4 | prime | | 10 | 1 | round | | 1 | 1 | square | | 4 | 2 | square | | 9 | 3 | square | +----+------+----------+
Partitioning by the X column groups all the duplicate numbers together and returns the place each value within the group; because each value occurs only 1 or 2 times, DENSE_RANK() designates each X value as either first or second within its group.
select x, dense_rank() over(partition by x order by property) as rank, property from int_t; +----+------+----------+ | x | rank | property | +----+------+----------+ | 1 | 1 | odd | | 1 | 2 | square | | 2 | 1 | even | | 2 | 2 | prime | | 3 | 1 | odd | | 3 | 2 | prime | | 4 | 1 | even | | 4 | 2 | square | | 5 | 1 | odd | | 5 | 2 | prime | | 6 | 1 | even | | 6 | 2 | perfect | | 7 | 1 | lucky | | 7 | 1 | lucky | | 7 | 1 | lucky | | 7 | 2 | odd | | 7 | 3 | prime | | 8 | 1 | even | | 9 | 1 | odd | | 9 | 2 | square | | 10 | 1 | even | | 10 | 2 | round | +----+------+----------+
The following example shows how DENSE_RANK() produces a continuous sequence while still allowing for ties. In this case, Croesus and Midas both have the second largest fortune, while Crassus has the third largest. (In RANK() Function, you see a similar query with the RANK() function that shows that while Crassus has the third largest fortune, he is the fourth richest person.)
select dense_rank() over (order by net_worth desc) as placement, name, net_worth from wealth order by placement, name; +-----------+---------+---------------+ | placement | name | net_worth | +-----------+---------+---------------+ | 1 | Solomon | 2000000000.00 | | 2 | Croesus | 1000000000.00 | | 2 | Midas | 1000000000.00 | | 3 | Crassus | 500000000.00 | | 4 | Scrooge | 80000000.00 | +-----------+---------+---------------+
Related information:
FIRST_VALUE() Function
Returns the expression value from the first row in the window. The return value is NULL if the input expression is NULL.
Syntax:
FIRST_VALUE(expr) OVER([partition_by_clause] order_by_clause [window_clause])
The PARTITION BY clause is optional. The ORDER BY clause is required. The window clause is optional.
Usage notes:
If any duplicate values occur in the tuples evaluated by the ORDER BY clause, the result of this function is not deterministic. Consider adding additional ORDER BY columns to ensure consistent ordering.
Added in: CDH 5.2.0 (Impala 2.0.0)
Examples:
The following example shows a table with a wide variety of country-appropriate greetings. For consistency, we want to standardize on a single greeting for each country. The FIRST_VALUE() function helps to produce a mail merge report where every person from the same country is addressed with the same greeting.
select name, country, greeting from mail_merge +---------+---------+--------------+ | name | country | greeting | +---------+---------+--------------+ | Pete | USA | Hello | | John | USA | Hi | | Boris | Germany | Guten tag | | Michael | Germany | Guten morgen | | Bjorn | Sweden | Hej | | Mats | Sweden | Tja | +---------+---------+--------------+ select country, name, first_value(greeting) over (partition by country order by name, greeting) as greeting from mail_merge; +---------+---------+-----------+ | country | name | greeting | +---------+---------+-----------+ | Germany | Boris | Guten tag | | Germany | Michael | Guten tag | | Sweden | Bjorn | Hej | | Sweden | Mats | Hej | | USA | John | Hi | | USA | Pete | Hi | +---------+---------+-----------+
Changing the order in which the names are evaluated changes which greeting is applied to each group.
select country, name, first_value(greeting) over (partition by country order by name desc, greeting) as greeting from mail_merge; +---------+---------+--------------+ | country | name | greeting | +---------+---------+--------------+ | Germany | Michael | Guten morgen | | Germany | Boris | Guten morgen | | Sweden | Mats | Tja | | Sweden | Bjorn | Tja | | USA | Pete | Hello | | USA | John | Hello | +---------+---------+--------------+
Related information:
LAG() Function
This function returns the value of an expression using column values from a preceding row. You specify an integer offset, which designates a row position some number of rows previous to the current row. Any column references in the expression argument refer to column values from that prior row. Typically, the table contains a time sequence or numeric sequence column that clearly distinguishes the ordering of the rows.
Syntax:
LAG (expr [, offset] [, default]) OVER ([partition_by_clause] order_by_clause)
The ORDER BY clause is required. The PARTITION BY clause is optional. The window clause is not allowed.
Usage notes:
Sometimes used an an alternative to doing a self-join.
Added in: CDH 5.2.0 (Impala 2.0.0)
Examples:
The following example uses the same stock data created in Window Clause. For each day, the query prints the closing price alongside the previous day's closing price. The first row for each stock symbol has no previous row, so that LAG() value is NULL.
select stock_symbol, closing_date, closing_price, lag(closing_price,1) over (partition by stock_symbol order by closing_date) as "yesterday closing" from stock_ticker order by closing_date; +--------------+---------------------+---------------+-------------------+ | stock_symbol | closing_date | closing_price | yesterday closing | +--------------+---------------------+---------------+-------------------+ | JDR | 2014-09-13 00:00:00 | 12.86 | NULL | | JDR | 2014-09-14 00:00:00 | 12.89 | 12.86 | | JDR | 2014-09-15 00:00:00 | 12.94 | 12.89 | | JDR | 2014-09-16 00:00:00 | 12.55 | 12.94 | | JDR | 2014-09-17 00:00:00 | 14.03 | 12.55 | | JDR | 2014-09-18 00:00:00 | 14.75 | 14.03 | | JDR | 2014-09-19 00:00:00 | 13.98 | 14.75 | +--------------+---------------------+---------------+-------------------+
The following example does an arithmetic operation between the current row and a value from the previous row, to produce a delta value for each day. This example also demonstrates how ORDER BY works independently in the different parts of the query. The ORDER BY closing_date in the OVER clause makes the query analyze the rows in chronological order. Then the outer query block uses ORDER BY closing_date DESC to present the results with the most recent date first.
select stock_symbol, closing_date, closing_price, cast( closing_price - lag(closing_price,1) over (partition by stock_symbol order by closing_date) as decimal(8,2) ) as "change from yesterday" from stock_ticker order by closing_date desc; +--------------+---------------------+---------------+-----------------------+ | stock_symbol | closing_date | closing_price | change from yesterday | +--------------+---------------------+---------------+-----------------------+ | JDR | 2014-09-19 00:00:00 | 13.98 | -0.76 | | JDR | 2014-09-18 00:00:00 | 14.75 | 0.72 | | JDR | 2014-09-17 00:00:00 | 14.03 | 1.47 | | JDR | 2014-09-16 00:00:00 | 12.55 | -0.38 | | JDR | 2014-09-15 00:00:00 | 12.94 | 0.04 | | JDR | 2014-09-14 00:00:00 | 12.89 | 0.03 | | JDR | 2014-09-13 00:00:00 | 12.86 | NULL | +--------------+---------------------+---------------+-----------------------+
Related information:
This function is the converse of LEAD() Function.
LAST_VALUE() Function
Returns the expression value from the last row in the window. This same value is repeated for all result rows for the group. The return value is NULL if the input expression is NULL.
Syntax:
LAST_VALUE(expr) OVER([partition_by_clause] order_by_clause [window_clause])
The PARTITION BY clause is optional. The ORDER BY clause is required. The window clause is optional.
Usage notes:
If any duplicate values occur in the tuples evaluated by the ORDER BY clause, the result of this function is not deterministic. Consider adding additional ORDER BY columns to ensure consistent ordering.
Added in: CDH 5.2.0 (Impala 2.0.0)
Examples:
The following example uses the same MAIL_MERGE table as in the example for FIRST_VALUE() Function. Because the default window when ORDER BY is used is BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, the query requires the UNBOUNDED FOLLOWING to look ahead to subsequent rows and find the last value for each country.
select country, name, last_value(greeting) over ( partition by country order by name, greeting rows between unbounded preceding and unbounded following ) as greeting from mail_merge +---------+---------+--------------+ | country | name | greeting | +---------+---------+--------------+ | Germany | Boris | Guten morgen | | Germany | Michael | Guten morgen | | Sweden | Bjorn | Tja | | Sweden | Mats | Tja | | USA | John | Hello | | USA | Pete | Hello | +---------+---------+--------------+
Related information:
LEAD() Function
This function returns the value of an expression using column values from a following row. You specify an integer offset, which designates a row position some number of rows after to the current row. Any column references in the expression argument refer to column values from that later row. Typically, the table contains a time sequence or numeric sequence column that clearly distinguishes the ordering of the rows.
Syntax:
LEAD (expr [, offset] [, default]) OVER ([partition_by_clause] order_by_clause)
The ORDER BY clause is required. The PARTITION BY clause is optional. The window clause is not allowed.
Usage notes:
Sometimes used an an alternative to doing a self-join.
Added in: CDH 5.2.0 (Impala 2.0.0)
Examples:
The following example uses the same stock data created in Window Clause. The query analyzes the closing price for a stock symbol, and for each day evaluates if the closing price for the following day is higher or lower.
select stock_symbol, closing_date, closing_price, case (lead(closing_price,1) over (partition by stock_symbol order by closing_date) - closing_price) > 0 when true then "higher" when false then "flat or lower" end as "trending" from stock_ticker order by closing_date; +--------------+---------------------+---------------+---------------+ | stock_symbol | closing_date | closing_price | trending | +--------------+---------------------+---------------+---------------+ | JDR | 2014-09-13 00:00:00 | 12.86 | higher | | JDR | 2014-09-14 00:00:00 | 12.89 | higher | | JDR | 2014-09-15 00:00:00 | 12.94 | flat or lower | | JDR | 2014-09-16 00:00:00 | 12.55 | higher | | JDR | 2014-09-17 00:00:00 | 14.03 | higher | | JDR | 2014-09-18 00:00:00 | 14.75 | flat or lower | | JDR | 2014-09-19 00:00:00 | 13.98 | NULL | +--------------+---------------------+---------------+---------------+
Related information:
This function is the converse of LAG() Function.
MAX() Function - Analytic Context
You can include an OVER clause with a call to this function to use it as an analytic function. See MAX Function for details and examples.
MIN() Function - Analytic Context
You can include an OVER clause with a call to this function to use it as an analytic function. See MIN Function for details and examples.
RANK() Function
Returns an ascending sequence of integers, starting with 1. The output sequence produces duplicate integers for duplicate values of the ORDER BY expressions. After generating duplicate output values for the "tied" input values, the function increments the sequence by the number of tied values. Therefore, the sequence contains both duplicates and gaps when the input contains duplicates. Starts the sequence over for each group produced by the PARTITIONED BY clause.
Syntax:
RANK() OVER([partition_by_clause] order_by_clause)
The PARTITION BY clause is optional. The ORDER BY clause is required. The window clause is not allowed.
Usage notes:
Often used for top-N and bottom-N queries. For example, it could produce a "top 10" report including several items that were tied for 10th place.
Similar to ROW_NUMBER and DENSE_RANK. These functions differ in how they treat duplicate combinations of values.
Added in: CDH 5.2.0 (Impala 2.0.0)
Examples:
The following example demonstrates how the RANK() function identifies where each value "places" in the result set, producing the same result for duplicate values, and skipping values in the sequence to account for the number of duplicates. For example, when results are ordered by the X column, both 1 values are tied for first; both 2 values are tied for third; and so on.
select x, rank() over(order by x) as rank, property from int_t; +----+------+----------+ | x | rank | property | +----+------+----------+ | 1 | 1 | square | | 1 | 1 | odd | | 2 | 3 | even | | 2 | 3 | prime | | 3 | 5 | prime | | 3 | 5 | odd | | 4 | 7 | even | | 4 | 7 | square | | 5 | 9 | odd | | 5 | 9 | prime | | 6 | 11 | even | | 6 | 11 | perfect | | 7 | 13 | lucky | | 7 | 13 | lucky | | 7 | 13 | lucky | | 7 | 13 | odd | | 7 | 13 | prime | | 8 | 18 | even | | 9 | 19 | square | | 9 | 19 | odd | | 10 | 21 | round | | 10 | 21 | even | +----+------+----------+
The following examples show how the RANK() function is affected by the PARTITION property within the ORDER BY clause.
Partitioning by the PROPERTY column groups all the even, odd, and so on values together, and RANK() returns the place of each value within the group, producing several ascending sequences.
select x, rank() over(partition by property order by x) as rank, property from int_t; +----+------+----------+ | x | rank | property | +----+------+----------+ | 2 | 1 | even | | 4 | 2 | even | | 6 | 3 | even | | 8 | 4 | even | | 10 | 5 | even | | 7 | 1 | lucky | | 7 | 1 | lucky | | 7 | 1 | lucky | | 1 | 1 | odd | | 3 | 2 | odd | | 5 | 3 | odd | | 7 | 4 | odd | | 9 | 5 | odd | | 6 | 1 | perfect | | 2 | 1 | prime | | 3 | 2 | prime | | 5 | 3 | prime | | 7 | 4 | prime | | 10 | 1 | round | | 1 | 1 | square | | 4 | 2 | square | | 9 | 3 | square | +----+------+----------+
Partitioning by the X column groups all the duplicate numbers together and returns the place each value within the group; because each value occurs only 1 or 2 times, RANK() designates each X value as either first or second within its group.
select x, rank() over(partition by x order by property) as rank, property from int_t; +----+------+----------+ | x | rank | property | +----+------+----------+ | 1 | 1 | odd | | 1 | 2 | square | | 2 | 1 | even | | 2 | 2 | prime | | 3 | 1 | odd | | 3 | 2 | prime | | 4 | 1 | even | | 4 | 2 | square | | 5 | 1 | odd | | 5 | 2 | prime | | 6 | 1 | even | | 6 | 2 | perfect | | 7 | 1 | lucky | | 7 | 1 | lucky | | 7 | 1 | lucky | | 7 | 4 | odd | | 7 | 5 | prime | | 8 | 1 | even | | 9 | 1 | odd | | 9 | 2 | square | | 10 | 1 | even | | 10 | 2 | round | +----+------+----------+
The following example shows how a magazine might prepare a list of history's wealthiest people. Croesus and Midas are tied for second, then Crassus is fourth.
select rank() over (order by net_worth desc) as rank, name, net_worth from wealth order by rank, name; +------+---------+---------------+ | rank | name | net_worth | +------+---------+---------------+ | 1 | Solomon | 2000000000.00 | | 2 | Croesus | 1000000000.00 | | 2 | Midas | 1000000000.00 | | 4 | Crassus | 500000000.00 | | 5 | Scrooge | 80000000.00 | +------+---------+---------------+
Related information:
ROW_NUMBER() Function
Returns an ascending sequence of integers, starting with 1. Starts the sequence over for each group produced by the PARTITIONED BY clause. The output sequence includes different values for duplicate input values. Therefore, the sequence never contains any duplicates or gaps, regardless of duplicate input values.
Syntax:
ROW_NUMBER() OVER([partition_by_clause] order_by_clause)
The ORDER BY clause is required. The PARTITION BY clause is optional. The window clause is not allowed.
Usage notes:
Often used for top-N and bottom-N queries where the input values are known to be unique, or precisely N rows are needed regardless of duplicate values.
Because its result value is different for each row in the result set (when used without a PARTITION BY clause), ROW_NUMBER() can be used to synthesize unique numeric ID values, for example for result sets involving unique values or tuples.
Similar to RANK and DENSE_RANK. These functions differ in how they treat duplicate combinations of values.
Added in: CDH 5.2.0 (Impala 2.0.0)
Examples:
The following example demonstrates how ROW_NUMBER() produces a continuous numeric sequence, even though some values of X are repeated.
select x, row_number() over(order by x, property) as row_number, property from int_t; +----+------------+----------+ | x | row_number | property | +----+------------+----------+ | 1 | 1 | odd | | 1 | 2 | square | | 2 | 3 | even | | 2 | 4 | prime | | 3 | 5 | odd | | 3 | 6 | prime | | 4 | 7 | even | | 4 | 8 | square | | 5 | 9 | odd | | 5 | 10 | prime | | 6 | 11 | even | | 6 | 12 | perfect | | 7 | 13 | lucky | | 7 | 14 | lucky | | 7 | 15 | lucky | | 7 | 16 | odd | | 7 | 17 | prime | | 8 | 18 | even | | 9 | 19 | odd | | 9 | 20 | square | | 10 | 21 | even | | 10 | 22 | round | +----+------------+----------+
The following example shows how a financial institution might assign customer IDs to some of history's wealthiest figures. Although two of the people have identical net worth figures, unique IDs are required for this purpose. ROW_NUMBER() produces a sequence of five different values for the five input rows.
select row_number() over (order by net_worth desc) as account_id, name, net_worth from wealth order by account_id, name; +------------+---------+---------------+ | account_id | name | net_worth | +------------+---------+---------------+ | 1 | Solomon | 2000000000.00 | | 2 | Croesus | 1000000000.00 | | 3 | Midas | 1000000000.00 | | 4 | Crassus | 500000000.00 | | 5 | Scrooge | 80000000.00 | +------------+---------+---------------+
Related information:
SUM() Function - Analytic Context
You can include an OVER clause with a call to this function to use it as an analytic function. See SUM Function for details and examples.