Joins in Impala SELECT statements
A join query is a SELECT
statement that combines data from two or more
tables, and returns a result set containing items from some or all of those tables. It is a way
to cross-reference and correlate related data that is organized into multiple tables, typically
using identifiers that are repeated in each of the joined tables.
Syntax:
Impala supports a wide variety of JOIN
clauses. Left, right, semi, full, and outer joins
are supported in all Impala versions. The CROSS JOIN
operator is available in Impala 1.2.2
and higher. During performance tuning, you can override the reordering of join clauses that Impala does
internally by including the keyword STRAIGHT_JOIN
immediately after the
SELECT
and any DISTINCT
or ALL
keywords.
SELECT select_list FROM
table_or_subquery1 [INNER] JOIN table_or_subquery2 |
table_or_subquery1 {LEFT [OUTER] | RIGHT [OUTER] | FULL [OUTER]} JOIN table_or_subquery2 |
table_or_subquery1 {LEFT | RIGHT} SEMI JOIN table_or_subquery2 |
table_or_subquery1 {LEFT | RIGHT} ANTI JOIN table_or_subquery2 |
[ ON col1 = col2 [AND col3 = col4 ...] |
USING (col1 [, col2 ...]) ]
[other_join_clause ...]
[ WHERE where_clauses ]
SELECT select_list FROM
table_or_subquery1, table_or_subquery2 [, table_or_subquery3 ...]
[other_join_clause ...]
WHERE
col1 = col2 [AND col3 = col4 ...]
SELECT select_list FROM
table_or_subquery1 CROSS JOIN table_or_subquery2
[other_join_clause ...]
[ WHERE where_clauses ]
SQL-92 and SQL-89 Joins:
Queries with the explicit JOIN
keywords are known as SQL-92 style joins, referring to the
level of the SQL standard where they were introduced. The corresponding ON
or
USING
clauses clearly show which columns are used as the join keys in each case:
SELECT t1.c1, t2.c2 FROM t1 JOIN t2
ON t1.id = t2.id and t1.type_flag = t2.type_flag
WHERE t1.c1 > 100;
SELECT t1.c1, t2.c2 FROM t1 JOIN t2
USING (id, type_flag)
WHERE t1.c1 > 100;
The ON
clause is a general way to compare columns across the two tables, even if the column
names are different. The USING
clause is a shorthand notation for specifying the join
columns, when the column names are the same in both tables. You can code equivalent WHERE
clauses that compare the columns, instead of ON
or USING
clauses, but that
practice is not recommended because mixing the join comparisons with other filtering clauses is typically
less readable and harder to maintain.
Queries with a comma-separated list of tables and subqueries are known as SQL-89 style joins. In these
queries, the equality comparisons between columns of the joined tables go in the WHERE
clause alongside other kinds of comparisons. This syntax is easy to learn, but it is also easy to
accidentally remove a WHERE
clause needed for the join to work correctly.
SELECT t1.c1, t2.c2 FROM t1, t2
WHERE
t1.id = t2.id AND t1.type_flag = t2.type_flag
AND t1.c1 > 100;
Self-joins:
Impala can do self-joins, for example to join on two different columns in the same table to represent parent-child relationships or other tree-structured data. There is no explicit syntax for this; just use the same table name for both the left-hand and right-hand table, and assign different table aliases to use when referring to the fully qualified column names:
-- Combine fields from both parent and child rows.
SELECT lhs.id, rhs.parent, lhs.c1, rhs.c2 FROM tree_data lhs, tree_data rhs WHERE lhs.id = rhs.parent;
Inner and outer joins:
An inner join is the most common and familiar type: rows in the result set contain the requested columns from the appropriate tables, for all combinations of rows where the join columns of the tables have identical values. If a column with the same name occurs in both tables, use a fully qualified name or a column alias to refer to the column in the select list or other clauses. Impala performs inner joins by default for both SQL-89 and SQL-92 join syntax:
-- The following 3 forms are all equivalent.
SELECT t1.id, c1, c2 FROM t1, t2 WHERE t1.id = t2.id;
SELECT t1.id, c1, c2 FROM t1 JOIN t2 ON t1.id = t2.id;
SELECT t1.id, c1, c2 FROM t1 INNER JOIN t2 ON t1.id = t2.id;
An outer join retrieves all rows from the left-hand table, or the right-hand table, or both; wherever there
is no matching data in the table on the other side of the join, the corresponding columns in the result set
are set to NULL
. To perform an outer join, include the OUTER
keyword in the
join operator, along with either LEFT
, RIGHT
, or FULL
:
SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.id;
SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.id = t2.id;
SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id;
For outer joins, Impala requires SQL-92 syntax; that is, the JOIN
keyword instead of
comma-separated table names. Impala does not support vendor extensions such as (+)
or
*=
notation for doing outer joins with SQL-89 query syntax.
Equijoins and Non-Equijoins:
By default, Impala requires an equality comparison between the left-hand and right-hand tables, either
through ON
, USING
, or WHERE
clauses. These types of
queries are classified broadly as equijoins. Inner, outer, full, and semi joins can all be equijoins based on
the presence of equality tests between columns in the left-hand and right-hand tables.
In Impala 1.2.2 and higher, non-equijoin queries are also possible, with comparisons such as
!=
or <
between the join columns. These kinds of queries require care to
avoid producing huge result sets that could exceed resource limits. Once you have planned a non-equijoin
query that produces a result set of acceptable size, you can code the query using the CROSS
JOIN
operator, and add the extra comparisons in the WHERE
clause:
SELECT * FROM t1 CROSS JOIN t2 WHERE t1.total > t2.maximum_price;
In Impala 2.3 and higher, additional non-equijoin queries are possible due to the
addition of nested loop joins. These queries typically involve SEMI JOIN
,
ANTI JOIN
, or FULL OUTER JOIN
clauses. Impala sometimes
also uses nested loop joins internally when evaluating OUTER JOIN
queries
involving complex type columns. Query phases involving nested loop joins do not use the
spill-to-disk mechanism if they exceed the memory limit. Impala decides internally when to
use each join mechanism; you cannot specify any query hint to choose between the nested loop
join or the original hash join algorithm.
SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.int_col < t2.int_col;
Semi-joins:
Semi-joins are a relatively rarely used variation. With the left semi-join, only data from the left-hand
table is returned, for rows where there is matching data in the right-hand table, based on comparisons
between join columns in ON
or WHERE
clauses. Only one instance of each row
from the left-hand table is returned, regardless of how many matching rows exist in the right-hand table.
A right semi-join (available in Impala 2.0 and higher) reverses the comparison and returns
data from the right-hand table.
SELECT t1.c1, t1.c2, t1.c2 FROM t1 LEFT SEMI JOIN t2 ON t1.id = t2.id;
Natural joins (not supported):
Impala does not support the NATURAL JOIN
operator, again to avoid inconsistent or huge
result sets. Natural joins do away with the ON
and USING
clauses, and
instead automatically join on all columns with the same names in the left-hand and right-hand tables. This
kind of query is not recommended for rapidly evolving data structures such as are typically used in Hadoop.
Thus, Impala does not support the NATURAL JOIN
syntax, which can produce different query
results as columns are added to or removed from tables.
If you do have any queries that use NATURAL JOIN
, make sure to rewrite them with explicit
USING
clauses, because Impala could interpret the NATURAL
keyword as a
table alias:
-- 'NATURAL' is interpreted as an alias for 't1' and Impala attempts an inner join,
-- resulting in an error because inner joins require explicit comparisons between columns.
SELECT t1.c1, t2.c2 FROM t1 NATURAL JOIN t2;
ERROR: NotImplementedException: Join with 't2' requires at least one conjunctive equality predicate.
To perform a Cartesian product between two tables, use a CROSS JOIN.
-- If you expect the tables to have identically named columns with matching values,
-- list the corresponding column names in a USING clause.
SELECT t1.c1, t2.c2 FROM t1 JOIN t2 USING (id, type_flag, name, address);
Anti-joins (Impala 2.0 and higher only):
Impala supports the LEFT ANTI JOIN
and RIGHT ANTI
JOIN
clauses in Impala 2.0 and higher. The LEFT
or
RIGHT
keyword is required for this kind of join. For LEFT ANTI
JOIN
, this clause returns those values from the left-hand table that have no
matching value in the right-hand table. RIGHT ANTI JOIN
reverses the
comparison and returns values from the right-hand table. You can express this negative
relationship either through the ANTI JOIN
clause or through a NOT
EXISTS
operator with a subquery.
Complex type considerations:
When referring to a column with a complex type (STRUCT
,
ARRAY
, or MAP
) in a query, you use join notation to
unpack
the scalar fields of the struct, the elements of the array, or the key-value
pairs of the map. (The join notation is not required for aggregation operations, such as
COUNT()
or SUM()
for array elements.) Because Impala
recognizes which complex type elements are associated with which row of the result set, you
use the same syntax as for a cross or cartesian join, without an explicit join
condition.
Usage notes:
You typically use join queries in situations like these:
- When related data arrives from different sources, with each data set physically residing in a separate table. For example, you might have address data from business records that you cross-check against phone listings or census data.
-
When data is normalized, a technique for reducing data duplication by dividing it across multiple tables.
This kind of organization is often found in data that comes from traditional relational database systems.
For example, instead of repeating some long string such as a customer name in multiple tables, each table
might contain a numeric customer ID. Queries that need to display the customer name could
join
the table that specifies which customer ID corresponds to which name. -
When certain columns are rarely needed for queries, so they are moved into separate tables to reduce
overhead for common queries. For example, a
biography
field might be rarely needed in queries on employee data. Putting that field in a separate table reduces the amount of I/O for common queries on employee addresses or phone numbers. Queries that do need thebiography
column can retrieve it by performing a join with that separate table. - In Impala 2.3 or higher, when referring to complex type columns in queries.
When comparing columns with the same names in ON
or WHERE
clauses, use the
fully qualified names such as db_name.table_name
, or
assign table aliases, column aliases, or both to make the code more compact and understandable:
select t1.c1 as first_id, t2.c2 as second_id from
t1 join t2 on first_id = second_id;
select fact.custno, dimension.custno from
customer_data as fact join customer_address as dimension
using (custno)
To control the result set from a join query, include the names of corresponding column names in both tables
in an ON
or USING
clause, or by coding equality comparisons for those
columns in the WHERE
clause.
[localhost:21000] > select c_last_name, ca_city from customer join customer_address where c_customer_sk = ca_address_sk;
+-------------+-----------------+
| c_last_name | ca_city |
+-------------+-----------------+
| Lewis | Fairfield |
| Moses | Fairview |
| Hamilton | Pleasant Valley |
| White | Oak Ridge |
| Moran | Glendale |
...
| Richards | Lakewood |
| Day | Lebanon |
| Painter | Oak Hill |
| Bentley | Greenfield |
| Jones | Stringtown |
+-------------+------------------+
Returned 50000 row(s) in 9.82s
One potential downside of joins is the possibility of excess resource usage in poorly constructed queries.
Impala imposes restrictions on join queries to guard against such issues. To minimize the chance of runaway
queries on large data sets, Impala requires every join query to contain at least one equality predicate
between the columns of the various tables. For example, if T1
contains 1000 rows and
T2
contains 1,000,000 rows, a query SELECT columns FROM t1 JOIN
t2
could return up to 1 billion rows (1000 * 1,000,000); Impala requires that the query include a
clause such as ON t1.c1 = t2.c2
or WHERE t1.c1 = t2.c2
.
Because even with equality clauses, the result set can still be large, as we saw in the previous example, you
might use a LIMIT
clause to return a subset of the results:
[localhost:21000] > select c_last_name, ca_city from customer, customer_address where c_customer_sk = ca_address_sk limit 10;
+-------------+-----------------+
| c_last_name | ca_city |
+-------------+-----------------+
| Lewis | Fairfield |
| Moses | Fairview |
| Hamilton | Pleasant Valley |
| White | Oak Ridge |
| Moran | Glendale |
| Sharp | Lakeview |
| Wiles | Farmington |
| Shipman | Union |
| Gilbert | New Hope |
| Brunson | Martinsville |
+-------------+-----------------+
Returned 10 row(s) in 0.63s
Or you might use additional comparison operators or aggregation functions to condense a large result set into a smaller set of values:
[localhost:21000] > -- Find the names of customers who live in one particular town.
[localhost:21000] > select distinct c_last_name from customer, customer_address where
c_customer_sk = ca_address_sk
and ca_city = "Green Acres";
+---------------+
| c_last_name |
+---------------+
| Hensley |
| Pearson |
| Mayer |
| Montgomery |
| Ricks |
...
| Barrett |
| Price |
| Hill |
| Hansen |
| Meeks |
+---------------+
Returned 332 row(s) in 0.97s
[localhost:21000] > -- See how many different customers in this town have names starting with "A".
[localhost:21000] > select count(distinct c_last_name) from customer, customer_address where
c_customer_sk = ca_address_sk
and ca_city = "Green Acres"
and substr(c_last_name,1,1) = "A";
+-----------------------------+
| count(distinct c_last_name) |
+-----------------------------+
| 12 |
+-----------------------------+
Returned 1 row(s) in 1.00s
Because a join query can involve reading large amounts of data from disk, sending large amounts of data across the network, and loading large amounts of data into memory to do the comparisons and filtering, you might do benchmarking, performance analysis, and query tuning to find the most efficient join queries for your data set, hardware capacity, network configuration, and cluster workload.
The two categories of joins in Impala are known as partitioned joins and broadcast joins. If inaccurate table or column statistics, or some quirk of the data distribution, causes Impala to choose the wrong mechanism for a particular join, consider using query hints as a temporary workaround.
Handling NULLs in join columns:
By default, join key columns do not match if either one contains a
NULL
value. To treat such columns as equal if both contain
NULL
, you can use an expression such as A = B OR (A IS NULL AND B
IS NULL)
. In Impala 2.5 and higher, the <=>
operator
(shorthand for IS NOT DISTINCT FROM
) performs the same comparison in a
concise and efficient form. The <=>
operator is more efficient in for
comparing join keys in a NULL
-safe manner, because the operator can use a
hash join while the OR
expression cannot.
Examples:
[localhost:21000] > create table t1 (x int);
[localhost:21000] > insert into t1 values (1), (2), (3), (4), (5), (6);
[localhost:21000] > create table t2 (y int);
[localhost:21000] > insert into t2 values (2), (4), (6);
[localhost:21000] > create table t3 (z int);
[localhost:21000] > insert into t3 values (1), (3), (5);
The following example demonstrates an anti-join, returning the values from T1
that do not
exist in T2
(in this case, the odd numbers 1, 3, and 5):
[localhost:21000] > select x from t1 left anti join t2 on (t1.x = t2.y);
+---+
| x |
+---+
| 1 |
| 3 |
| 5 |
+---+