OVER
The OVER
clause is required for calls to pure
analytic functions such as LEAD()
,
RANK()
, and FIRST_VALUE()
.
When you include an OVER
clause with calls to
aggregate functions such as MAX()
,
COUNT()
, or SUM()
, they
operate as analytic functions.
Syntax:
function(args) OVER([partition_by_clause] [order_by_clause [window_clause]])
partition_by_clause ::= PARTITION BY expr [, expr ...]
order_by_clause ::= ORDER BY expr [ASC | DESC] [NULLS FIRST | NULLS LAST]
[, expr [ASC | DESC] [NULLS FIRST | NULLS LAST] ...]
window_clause: See Window clause
PARTITION BY clause:
The PARTITION BY
clause acts much like the
GROUP BY
clause in the outermost block of a
query. It divides the rows into groups containing identical values
in one or more columns. These logical groups are known as
partitions. Throughout the discussion of analytic
functions, partitions
refers to the groups produced by the
PARTITION BY
clause, not to partitioned tables.
However, note the following limitation that applies specifically
to analytic function calls involving partitioned tables.
In queries involving both analytic functions and partitioned tables, partition pruning only occurs for columns named in the PARTITION BY
clause of the analytic function call. For example, if an analytic function query has a clause such as WHERE year=2016
,
the way to make the query prune all other YEAR
partitions is to include PARTITION BY year
in the analytic function call;
for example, OVER (PARTITION BY year,other_columns other_analytic_clauses)
.
The sequence of results from an analytic function resets
for each new partition in the result set. That is, the set of
preceding or following rows considered by the analytic function
always come from a single partition. Any MAX()
,
SUM()
, ROW_NUMBER()
, and so on
apply to each partition independently. Omit the PARTITION
BY
clause to apply the analytic operation to all the
rows in the table.
ORDER BY clause:
The ORDER BY
clause works much like the
ORDER BY
clause in the outermost block of a
query. It defines the order in which rows are evaluated for the
entire input set, or for each group produced by a
PARTITION BY
clause. You can order by one or
multiple expressions, and for each expression optionally choose
ascending or descending order and whether nulls come first or last
in the sort order. Because this ORDER BY
clause
only defines the order in which rows are evaluated, if you want
the results to be output in a specific order, also include an
ORDER BY
clause in the outer block of the
query.
When the ORDER BY
clause is omitted, the
analytic function applies to all items in the group produced by
the PARTITION BY
clause. When the ORDER
BY
clause is included, the analysis can apply to all or
a subset of the items in the group, depending on the optional
window clause.
The order in which the rows are analyzed is only defined for
those columns specified in ORDER BY
clauses.
One difference between the analytic and outer uses of the
ORDER BY
clause: inside the
OVER
clause, ORDER BY 1
or
other integer value is interpreted as a constant sort value
(effectively a no-op) rather than referring to column 1.
Window clause:
The window clause is only allowed in combination with an
ORDER BY
clause. If the ORDER
BY
clause is specified but the window clause is not,
the default window is RANGE BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
.
HBase considerations:
Because HBase tables are optimized for single-row lookups rather
than full scans, analytic functions using the
OVER()
clause are not recommended for HBase
tables. Although such queries work, their performance is lower
than on comparable tables using HDFS data files.
Parquet considerations:
Analytic functions are very efficient for Parquet tables. The
data that is examined during evaluation of the
OVER()
clause comes from a specified set of
columns, and the values for each column are arranged sequentially
within each data file.
Text table considerations:
Analytic functions are convenient to use with text tables for exploratory business intelligence. When the volume of data is substantial, prefer to use Parquet tables for performance-critical analytic queries.
Examples:
The following example shows how to synthesize a numeric sequence
corresponding to all the rows in a table. The new table has the
same columns as the old one, plus an additional column
ID
containing the integers 1, 2, 3, and so on,
corresponding to the order of a TIMESTAMP
column
in the original table.
CREATE TABLE events_with_id AS
SELECT
row_number() OVER (ORDER BY date_and_time) AS id,
c1, c2, c3, c4
FROM events;
The following example shows how to determine the number of rows
containing each value for a column. Unlike a corresponding
GROUP BY
query, this one can analyze a single
column and still return all values (not just the distinct ones)
from the other columns.
SELECT x, y, z,
count() OVER (PARTITION BY x) AS how_many_x
FROM t1;
Restrictions:
You cannot directly combine the DISTINCT
operator with analytic function calls. You can put the analytic
function call in a WITH
clause or an inline view,
and apply the DISTINCT
operator to its result
set.
WITH t1 AS (SELECT x, sum(x) OVER (PARTITION BY x) AS total FROM t1)
SELECT DISTINCT x, total FROM t1;