Previous versions of Hive allowed subqueries only in FROM clauses of SQL statements. A subquery is a SQL expression that returns a set of rows. The subquery is evaluated and its query result set is used to evaluate the parent query, the outer query that contains the child subquery. Version 0.13 of Hive expands the use of subqueries to include WHERE clauses, as shown in the following example:
SELECT state, net_payments FROM transfer_payments WHERE transfer_payments.year IN (SELECT year FROM us_census);
No configuration is required to enable execution of subqueries in Hive; the feature is available by default. However, several restrictions exist for the use of subqueries in WHERE clauses.
Understanding Subqueries in SQL
SQL adheres to syntax rules like any programming language. The syntax governing the use of subqueries in WHERE clauses in SQL depends on the following concepts:
Query Predicates and Predicate Operators
Aggregrated and Correlated Queries
Conjuncts and Disjuncts
Query Predicates and Predicate Operators
A predicate in SQL is a condition that evaluates to a Boolean value. For example, the predicate in the preceeding example returns true for a row of the transfer_payments
table if at least one row exists in the us_census
table with the same year as the transfer_payments
row. The predicate starts with the first WHERE keyword.
... WHERE transfer_payments.year IN (SELECT year FROM us_census);
A SQL predicate in a subquery must also contain a predicate operator. Predicate operators specify the relationship tested in a predicate query. For example, the predicate operator in the above example is the IN keyword.
Aggregated and Correlated Queries
Aggregated queries combine one or more aggregate functions, such as AVG, SUM, and MAX, with the GROUP BY statement to group query results by one or more table columns. In the following example, the AVG aggregate function returns the average salary of all employees in the engineering department grouped by year:
SELECT year, AVG(salary) FROM Employees WHERE department = 'engineering' GROUP BY year
Note | |
---|---|
The GROUP BY statement may be either explicit or implicit. |
Correlated queries contain a query predicate with the equals (=) operator. One side of the operator must reference at least one column from the parent query and the other side must reference at least one column from the subquery. The following query is a revised and correlated version of the example query that is shown at the beginning of this section. It is a correlated query because one side of the equals predicate operator in the subquery references the state
column in the transfer_payments
table in the parent query and the other side of the operator references the state
column in the us_census
table.
SELECT state, net_payments FROM transfer_payments WHERE EXISTS (SELECT year FROM us_census WHERE transfer_payments.state = us_census.state);
In contrast, an uncorrelated query does not reference any columns in the parent query.
Conjuncts and Disjuncts
A conjunct is equivalent to the AND condition, while a disjunct is the equivalent of the OR condition. The following subquery contains a conjunct:
... WHERE transfer_payments.year = "2010" AND us_census.state = "california"
The following subquery contains a disjunct:
... WHERE transfer_payments.year = "2010" OR us_census.state = "california"
Restrictions on Subqueries in WHERE Clauses
Subqueries in WHERE clauses have the following limitations:
Subqueries must appear on the right hand side of an expression.
Nested subqueries are not supported.
Only one subquery expression is allowed for a single query.
Subquery predicates must appear as top level conjuncts.
Subqueries support four logical operators in query predicates: IN, NOT IN, EXISTS, and NOT EXISTS.
The IN and NOT IN logical operators may select only one column in a WHERE clause subquery.
The EXISTS and NOT EXISTS operators must have at least one correlated predicate.
The left side of a subquery must qualify all references to table columns.
References to columns in the parent query are allowed only in the WHERE clause of the subquery.
Subquery predicates that reference a column in a parent query must use the equals (=) predicate operator.
Subquery predicates may not refer only to columns in the parent query.
Correlated subqueries with an implied GROUP BY statement may return only one row.
All unqualified references to columns in a subquery must resolve to tables in the subquery.
Correlated subqueries cannot contain windowing clauses.