HDP Data Services
Also available as:
PDF

Subqueries

Hive supports subqueries in FROM clauses and in WHERE clauses of SQL statements. A subquery is a SQL expression that is evaluated and returns a result set. Then that result set is used to evaluate the parent query. The parent query is the outer query that contains the child subquery. Subqueries in WHERE clauses are supported in Hive 0.13 and later. The following example shows a subquery inserted into a WHERE clause:

SELECT state, net_payments
FROM transfer_payments
WHERE transfer_payments.year IN (SELECT year FROM us_census); 

No configuration is required to enable execution of subqueries in Hive. The feature is available by default. However, several restrictions exist for the use of subqueries in WHERE clauses.

Understanding Subqueries in SQL

SQL adheres to syntax rules like any programming language. The syntax governing the use of subqueries in WHERE clauses in SQL depends on the following concepts:

  • Query Predicates and Predicate Operators

    A predicate in SQL is a condition that evaluates to a Boolean value. For example, the predicate in the preceeding example returns true for a row of the transfer_payments table if at least one row exists in the us_census table with the same year as the transfer_payments row. The predicate starts with the first WHERE keyword.

    ... WHERE transfer_payments.year IN (SELECT year FROM us_census); 

    A SQL predicate in a subquery must also contain a predicate operator. Predicate operators specify the relationship tested in a predicate query. For example, the predicate operator in the above example is the IN keyword.

  • Aggregrated and Correlated Queries

    Aggregated queries combine one or more aggregate functions, such as AVG, SUM, and MAX, with the GROUP BY statement to group query results by one or more table columns. In the following example, the AVG aggregate function returns the average salary of all employees in the engineering department grouped by year:

    SELECT year, AVG(salary)
    FROM Employees
    WHERE department = 'engineering' GROUP BY year 
    [Note]Note

    The GROUP BY statement may be either explicit or implicit.

    Correlated queries contain a query predicate with the equals (=) operator. One side of the operator must reference at least one column from the parent query and the other side must reference at least one column from the subquery. The following query is a revised and correlated version of the example query that is shown at the beginning of this section. It is a correlated query because one side of the equals predicate operator in the subquery references the state column in the transfer_payments table in the parent query and the other side of the operator references the state column in the us_census table.

    SELECT state, net_payments
    FROM transfer_payments
    WHERE EXISTS 
       (SELECT year 
       FROM us_census 
       WHERE transfer_payments.state = us_census.state); 

    In contrast, an uncorrelated query does not reference any columns in the parent query.

  • Conjuncts and Disjuncts

    A conjunct is equivalent to the AND condition, while a disjunct is the equivalent of the OR condition. The following subquery contains a conjunct:

    ... WHERE transfer_payments.year = "2010" AND us_census.state = "california"

    The following subquery contains a disjunct:

    ... WHERE transfer_payments.year = "2010" OR us_census.state = "california"

Restrictions on Subqueries in WHERE Clauses

Subqueries in WHERE clauses have the following limitations:

  • Subqueries must appear on the right hand side of an expression.

  • Nested subqueries are not supported.

  • Only one subquery expression is allowed for a single query.

  • Subquery predicates must appear as top level conjuncts.

  • Subqueries support four logical operators in query predicates: IN, NOT IN, EXISTS, and NOT EXISTS.

  • The IN and NOT IN logical operators may select only one column in a WHERE clause subquery.

  • The EXISTS and NOT EXISTS operators must have at least one correlated predicate.

  • The left side of a subquery must qualify all references to table columns.

  • References to columns in the parent query are allowed only in the WHERE clause of the subquery.

  • Subquery predicates that reference a column in a parent query must use the equals (=) predicate operator.

  • Subquery predicates may not refer only to columns in the parent query.

  • Correlated subqueries with an implied GROUP BY statement may return only one row.

  • All unqualified references to columns in a subquery must resolve to tables in the subquery.

  • Correlated subqueries cannot contain windowing clauses.