UNION clause
The UNION
clause lets you combine the result sets of multiple queries.
By default, the result sets are combined as if the DISTINCT
operator was
applied.
Syntax:
query_1 UNION [DISTINCT | ALL] query_2
Usage notes:
The UNION
keyword by itself is the same as UNION DISTINCT
. Because
eliminating duplicates can be a memory-intensive process for a large result set, prefer UNION
ALL
where practical. (That is, when you know the different queries in the union will not produce any
duplicates, or where the duplicate values are acceptable.)
When an ORDER BY
clause applies to a UNION ALL
or UNION
query, in Impala 1.4 and higher, the LIMIT
clause is no longer required. To make the
ORDER BY
and LIMIT
clauses apply to the entire result set, turn the
UNION
query into a subquery, SELECT
from the subquery, and put the
ORDER BY
clause at the end, outside the subquery.
Examples:
First, set up some sample data, including duplicate 1
values:
[localhost:21000] > create table few_ints (x int);
[localhost:21000] > insert into few_ints values (1), (1), (2), (3);
[localhost:21000] > set default_order_by_limit=1000;
This example shows how UNION ALL
returns all rows from both queries, without any additional
filtering to eliminate duplicates. For the large result sets common with Impala queries, this is the most
memory-efficient technique.
[localhost:21000] > select x from few_ints order by x;
+---+
| x |
+---+
| 1 |
| 1 |
| 2 |
| 3 |
+---+
Returned 4 row(s) in 0.41s
[localhost:21000] > select x from few_ints union all select x from few_ints;
+---+
| x |
+---+
| 1 |
| 1 |
| 2 |
| 3 |
| 1 |
| 1 |
| 2 |
| 3 |
+---+
Returned 8 row(s) in 0.42s
[localhost:21000] > select * from (select x from few_ints union all select x from few_ints) as t1 order by x;
+---+
| x |
+---+
| 1 |
| 1 |
| 1 |
| 1 |
| 2 |
| 2 |
| 3 |
| 3 |
+---+
Returned 8 row(s) in 0.53s
[localhost:21000] > select x from few_ints union all select 10;
+----+
| x |
+----+
| 10 |
| 1 |
| 1 |
| 2 |
| 3 |
+----+
Returned 5 row(s) in 0.38s
This example shows how the UNION
clause without the ALL
keyword condenses
the result set to eliminate all duplicate values, making the query take more time and potentially more
memory. The extra processing typically makes this technique not recommended for queries that return result
sets with millions or billions of values.
[localhost:21000] > select x from few_ints union select x+1 from few_ints;
+---+
| x |
+---+
| 3 |
| 4 |
| 1 |
| 2 |
+---+
Returned 4 row(s) in 0.51s
[localhost:21000] > select x from few_ints union select 10;
+----+
| x |
+----+
| 2 |
| 10 |
| 1 |
| 3 |
+----+
Returned 4 row(s) in 0.49s
[localhost:21000] > select * from (select x from few_ints union select x from few_ints) as t1 order by x;
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
+---+
Returned 3 row(s) in 0.53s