SELECT statement
The SELECT
statement performs queries, retrieving data from one or
more tables and producing result sets consisting of rows and columns.
The Impala INSERT
statement also typically ends with a SELECT
statement, to define data to copy from one table to another.
Syntax:
[WITH name AS (select_expression) [, ...] ]
SELECT
[ALL | DISTINCT]
[STRAIGHT_JOIN]
expression [, expression ...]
FROM table_reference [, table_reference ...]
[[FULL | [LEFT | RIGHT] INNER | [LEFT | RIGHT] OUTER | [LEFT | RIGHT] SEMI | [LEFT | RIGHT] ANTI | CROSS]
JOIN table_reference
[ON join_equality_clauses | USING (col1[, col2 ...]] ...
WHERE conditions
GROUP BY { column | expression [, ...] }
HAVING conditions
ORDER BY { column | expression [ASC | DESC] [NULLS FIRST | NULLS LAST] [, ...] }
LIMIT expression [OFFSET expression]
[UNION [ALL] select_statement] ...]
table_reference := { table_name | (subquery) }
[ TABLESAMPLE SYSTEM(percentage) [REPEATABLE(seed)] ]
Impala SELECT
queries support:
- SQL scalar data types:
BOOLEAN
,TINYINT
,SMALLINT
,INT
,BIGINT
,DECIMAL
FLOAT
,DOUBLE
,TIMESTAMP
,STRING
,VARCHAR
,CHAR
. - The complex data types
ARRAY
,STRUCT
, andMAP
, are available in Impala 2.3 and higher. Queries involving these types typically involve special qualified names using dot notation for referring to the complex column fields, and join clauses for bringing the complex columns into the result set. - An optional
WITH
clause before theSELECT
keyword, to define a subquery whose name or column names can be referenced from later in the main query. This clause lets you abstract repeated clauses, such as aggregation functions, that are referenced multiple times in the same query. - Subqueries in a
FROM
clause. In Impala 2.0 and higher, subqueries can also go in theWHERE
clause, for example with theIN()
,EXISTS
, andNOT EXISTS
operators. -
WHERE
,GROUP BY
,HAVING
clauses. -
ORDER BY
. Prior to Impala 1.4.0, Impala required that queries using anORDER BY
clause also include aLIMIT
clause. In Impala 1.4.0 and higher, this restriction is lifted; sort operations that would exceed the Impala memory limit automatically use a temporary disk work area to perform the sort. - You can refer to
SELECT
-list items by their ordinal position. Impala supports ordinals in theGROUP BY
,HAVING
, andORDER BY
clauses. From Impala 3.0, ordinals can only be used at the top level. For example, the following statements are allowed:
Numbers in subexpressions are not interpreted as ordinals:SELECT int_col / 2, sum(x) FROM t GROUP BY 1; SELECT int_col / 2 FROM t ORDER BY 1; SELECT NOT bool_col FROM t GROUP BY 1 HAVING 1;
SELECT int_col / 2, sum(x) FROM t GROUP BY 1 * 2; The above parses OK, however GROUP BY 1 * 2 has no effect. SELECT int_col / 2 FROM t ORDER BY 1 + 2; The above parses OK, however ORDER BY 1 + 2 has no effect. SELECT NOT bool_col FROM t GROUP BY 1 HAVING not 1; The above raises an error at parse-time.
- Impala supports a wide variety of
JOIN
clauses. Left, right, semi, full, and outer joins are supported in all Impala versions. TheCROSS JOIN
operator is available in Impala 1.2.2 and higher. During performance tuning, you can override the reordering of join clauses that Impala does internally by including the keywordSTRAIGHT_JOIN
immediately after theSELECT
and anyDISTINCT
orALL
keywords. -
UNION ALL
. -
LIMIT
. - External tables.
- Relational operators such as greater than, less than, or equal to.
- Arithmetic operators such as addition or subtraction.
- Logical/Boolean operators
AND
,OR
, andNOT
. Impala does not support the corresponding symbols&&
,||
, and!
. - Common SQL built-in functions such as
COUNT
,SUM
,CAST
,LIKE
,IN
,BETWEEN
, andCOALESCE
. - On optional
TABLESAMPLE
clause immediately after a table reference, to specify that the query only processes a specified percentage of the table data.
Impala queries ignore files with extensions commonly used for temporary work files by Hadoop tools. Any
files with extensions .tmp
or .copying
are not considered part of the
Impala table. The suffix matching is case-insensitive, so for example Impala ignores both
.copying
and .COPYING
suffixes.
Security considerations:
If these statements in your environment contain sensitive literal values such as credit card numbers or tax identifiers, Impala can redact this sensitive information when displaying the statements in log files and other administrative contexts.
Amazon S3 considerations:
In Impala 2.6 and higher, Impala queries are optimized for files stored in Amazon S3.
For Impala tables that use the file formats Parquet, ORC, RCFile, SequenceFile,
Avro, and uncompressed text, the setting fs.s3a.block.size
in the core-site.xml configuration file determines
how Impala divides the I/O work of reading the data files. This configuration
setting is specified in bytes. By default, this
value is 33554432 (32 MB), meaning that Impala parallelizes S3 read operations on the files
as if they were made up of 32 MB blocks. For example, if your S3 queries primarily access
Parquet files written by MapReduce or Hive, increase fs.s3a.block.size
to 134217728 (128 MB) to match the row group size of those files. If most S3 queries involve
Parquet files written by Impala, increase fs.s3a.block.size
to 268435456 (256 MB) to match the row group size produced by Impala.
Cancellation: Can be cancelled. To cancel this statement, use Ctrl-C from the impala-shell interpreter, the Cancel button from the Watch page in Hue, Actions > Cancel from the Queries list in Cloudera Manager, or Cancel from the list of in-flight queries (for a particular node) on the Queries tab in the Impala web UI (port 25000).
HDFS permissions:
The user ID that the impalad daemon runs under,
typically the impala
user, must have read
permissions for the files in all applicable directories in all source tables,
and read and execute permissions for the relevant data directories.
(A SELECT
operation could read files from multiple different HDFS directories
if the source table is partitioned.)
If a query attempts to read a data file and is unable to because of an HDFS permission error,
the query halts and does not return any further results.
Related information:
The SELECT
syntax is so extensive that it forms its own category of statements:
queries. The other major classifications of SQL statements are data definition language and
data manipulation language.
Because the focus of Impala is on fast queries with interactive response times over huge data sets, query performance and scalability are important considerations.