Impala with Amazon S3
You can use Impala to query data residing on the Amazon S3 object store. This capability allows convenient access to a storage system that is remotely managed, accessible from anywhere, and integrated with various cloud-based services.
Impala can query files in any supported file format from S3. The S3 storage location can be for an entire table, or individual partitions in a partitioned table.
The default Impala tables use data files stored on HDFS, which are ideal
for bulk loads and queries using full-table scans. In contrast, queries
against S3 data are less performant, making S3 suitable for holding
cold
data that is only queried occasionally, while more
frequently accessed hot
data resides in HDFS. In a partitioned
table, you can set the LOCATION
attribute for individual
partitions to put some partitions on HDFS and others on S3, typically
depending on the age of the data.
Impala requires that the default filesystem for the cluster be HDFS. You cannot use S3 as the only filesystem in the cluster.
Creating Impala Databases, Tables, and Partitions for Data Stored on S3
To create a table that resides on S3, run the CREATE
TABLE
or ALTER TABLE
statement with the
LOCATION
clause.
ALTER TABLE
can set the LOCATION
property for an individual partition, so that some data in a table
resides on S3 and other data in the same table resides on HDFS.
The syntax for the LOCATION clause is:
LOCATION 's3a://bucket_name/path/to/file'
The file system prefix is always s3a://
. Impala does
not support the s3://
or s3n://
prefixes.
For a partitioned table, either specify a separate
LOCATION
clause for each new partition, or specify a
base LOCATION
for the table and set up a directory
structure in S3 to mirror the way Impala partitioned tables are
structured in HDFS.
You point a nonpartitioned table or an individual partition at S3 by specifying a single directory path in S3, which could be any arbitrary directory. To replicate the structure of an entire Impala partitioned table or database in S3 requires more care, with directories and subdirectories nested and named to match the equivalent directory tree in HDFS. Consider setting up an empty staging area if necessary in HDFS, and recording the complete directory structure so that you can replicate it in S3.
When working with multiple tables with data files stored in S3, you
can create a database with a LOCATION
attribute
pointing to an S3 path. Specify a URL of the form
s3a://bucket/root/path/for/database
for the LOCATION
attribute of the database. Any tables
created inside that database automatically create directories underneath
the one specified by the database LOCATION
attribute.
For example, the following session creates a partitioned table where
only a single partition resides on S3. The partitions for years 2013 and
2014 are located on HDFS. The partition for year 2015 includes a
LOCATION
attribute with an s3a://
URL, and so refers to data residing on S3, under a specific path
underneath the bucket impala-demo
.
CREATE TABLE mostly_on_hdfs (x INT) PARTITIONED BY (year INT);
ALTER TABLE mostly_on_hdfs ADD PARTITION (year=2013);
ALTER TABLE mostly_on_hdfs DD PARTITION (year=2014);
ALTER TABLE mostly_on_hdfs DD PARTITION (year=2015)
LOCATION 's3a://impala-demo/dir1/dir2/dir3/t1';
The CREATE DATABASE
and CREATE TABLE
statements create the associated directory paths if they do not already
exist. You can specify multiple levels of directories, and the
CREATE
statement creates all appropriate levels,
similar to using mkdir -p
.
Use the standard S3 file upload methods to actually put the data files into the right locations. You can also put the directory paths and data files in place before creating the associated Impala databases or tables, and Impala automatically uses the data from the appropriate location after the associated databases and tables are created.
Use the ALTER TABLE
statement with the
LOCATION
clause to switch whether an existing table
or partition points to data in HDFS or S3. For example, if you have an
Impala table or partition pointing to data files in HDFS or S3, and you
later transfer those data files to the other filesystem, use the
ALTER TABLE
statement to adjust the
LOCATION
attribute of the corresponding table or
partition to reflect that change.
Internal and External Tables Located on S3
Just as with tables located on HDFS storage, you can designate
S3-based tables as either internal (managed by Impala) or external, by
using the syntax CREATE TABLE
or CREATE
EXTERNAL TABLE
respectively.
When you drop an internal table, the files associated with the table are removed, even if they are on S3 storage. When you drop an external table, the files associated with the table are left alone, and are still available for access by other tools or components.
If the data on S3 is intended to be long-lived and accessed by other
tools in addition to Impala, create any associated S3 tables with the
CREATE EXTERNAL TABLE
syntax, so that the files are
not deleted from S3 when the table is dropped.
If the data on S3 is only needed for querying by Impala and can be
safely discarded once the Impala workflow is complete, create the
associated S3 tables using the CREATE TABLE
syntax, so
that dropping the table also deletes the corresponding data files on S3.
You cannot use the ALTER TABLE ... SET CACHED
statement for tables or partitions that are located in S3.
Loading Data into S3 for Impala Queries
If your ETL pipeline involves moving data into S3 and then querying through Impala, you can either use Impala DML statements to create, move, or copy the data, or use the same data loading techniques as you would for non-Impala data.
Using Impala DML Statements for S3 Data:
Impala DML statements (INSERT
, LOAD
DATA
, and CREATE TABLE AS SELECT
) can write
data into a table or partition that resides in S3.
Because of differences between S3 and
traditional filesystems, DML operations for S3 tables can take longer
than for tables on HDFS. For example, both the LOAD
DATA
statement and the final stage of the
INSERT
and CREATE TABLE AS SELECT
statements involve moving files from one directory to another. (In the
case of INSERT
and CREATE TABLE AS
SELECT
, the files are moved from a temporary staging
directory to the final destination directory.) Because S3 does not
support a rename
operation for existing objects, in these cases
Impala actually copies the data files from one location to another and
then removes the original files.
Manually Loading Data into Impala Tables on S3:
You can use the Amazon-provided methods to bring data files into S3 for querying through Impala.
After you upload data files to a location already mapped to an Impala
table or partition, or if you delete files in S3 from such a location
outside of Impala, issue the REFRESH
statement to make
Impala aware of the new set of data files.
Running and Tuning Impala Queries for Data Stored on S3
Once a table or partition is designated as residing on S3, the
SELECT
statement transparently accesses the data
files from the appropriate storage layer.
- Queries against S3 data support all the same file formats as for HDFS data.
- Tables can be unpartitioned or partitioned. For partitioned tables,
either manually construct paths in S3 corresponding to the HDFS
directories representing partition key values, or use
ALTER TABLE ... ADD PARTITION
to set up the appropriate paths in S3. - HDFS and HBase tables can be joined to S3 tables, or S3 tables can be joined with each other.
- Authorization to control access to databases, tables, or columns works the same whether the data is in HDFS or in S3.
- The catalogd daemon caches metadata for both
HDFS and S3 tables. Use
REFRESH
andINVALIDATE METADATA
for S3 tables in the same situations where you would issue those statements for HDFS tables. - Queries against S3 tables are subject to the same kinds of admission control and resource management as HDFS tables.
- Metadata about S3 tables is stored in the same Metastore database as for HDFS tables.
- You can set up views referring to S3 tables, the same as for HDFS tables.
- The
COMPUTE STATS
,SHOW TABLE STATS
, andSHOW COLUMN STATS
statements work for S3 tables also.
Query Performance for S3
Although Impala queries for data stored in S3 might be less performant than queries against the equivalent data stored in HDFS, you can still do some tuning. Here are techniques you can use to interpret explain plans and profiles for queries against S3 data, and tips to achieve the best performance possible for such queries.
All else being equal, performance is expected to be lower for queries
running against data on S3 rather than HDFS. The actual mechanics of the
SELECT
statement are somewhat different when the data
is in S3. Although the work is still distributed across the datanodes of
the cluster, Impala might parallelize the work for a distributed query
differently for data on HDFS and S3. S3 does not have the same block
notion as HDFS, so Impala uses heuristics to determine how to split up
large S3 files for processing in parallel. Because all hosts can access
any S3 data file with equal efficiency, the distribution of work might
be different than for HDFS data, where the data blocks are physically
read using short-circuit local reads by hosts that contain the
appropriate block replicas. Although the I/O to read the S3 data might
be spread evenly across the hosts of the cluster, the fact that all data
is initially retrieved across the network means that the overall query
performance is likely to be lower for S3 data than for HDFS data.
Impala queries are optimized for files stored in Amazon S3. For Impala
tables that use the file formats Parquet, ORC, RCFile, SequenceFile,
Avro, and uncompressed text, the setting
fs.s3a.block.size
in the
core-site.xml configuration file determines how
Impala divides the I/O work of reading the data files. This
configuration setting is specified in bytes. By default, this value is
33554432 (32 MB), meaning that Impala parallelizes S3 read operations on
the files as if they were made up of 32 MB blocks. For example, if your
S3 queries primarily access Parquet files written by MapReduce or Hive,
increase fs.s3a.block.size
to 134217728 (128 MB) to
match the row group size of those files. If most S3 queries involve
Parquet files written by Impala, increase
fs.s3a.block.size
to 268435456 (256 MB) to match the
row group size produced by Impala.
Because of differences between S3 and traditional filesystems, DML
operations for S3 tables can take longer than for tables on HDFS. For
example, both the LOAD DATA
statement and the final
stage of the INSERT
and CREATE TABLE AS
SELECT
statements involve moving files from one directory to
another. (In the case of INSERT
and CREATE
TABLE AS SELECT
, the files are moved from a temporary staging
directory to the final destination directory.) Because S3 does not
support a rename
operation for existing objects, in these cases
Impala actually copies the data files from one location to another and
then removes the original files. The
S3_SKIP_INSERT_STAGING
query option provides a way to
speed up INSERT
statements for S3 tables and
partitions, with the tradeoff that a problem during statement execution
could leave data in an inconsistent state. It does not apply to
INSERT OVERWRITE
or LOAD DATA
statements.
When optimizing aspects of for complex queries such as the join order,
Impala treats tables on HDFS and S3 the same way. Therefore, follow all
the same tuning recommendations for S3 tables as for HDFS ones, such as
using the COMPUTE STATS
statement to help Impala
construct accurate estimates of row counts and cardinality.
In query profile reports, the numbers for
BytesReadLocal
,
BytesReadShortCircuit
,
BytesReadDataNodeCached
, and
BytesReadRemoteUnexpected
are blank because those
metrics come from HDFS. If you do see any indications that a query
against an S3 table performed remote read
operations, do not be
alarmed. That is expected because, by definition, all the I/O for S3
tables involves remote reads.
Best Practices for Using Impala with S3
The following guidelines represent best practices derived from testing and field experience with Impala on S3:
-
Any reference to an S3 location must be fully qualified. (This rule applies when S3 is not designated as the default filesystem.)
-
Set the safety valve
fs.s3a.connection.maximum
to 1500 for impalad. -
Set safety valve
fs.s3a.block.size
to 134217728 (128 MB in bytes) if most Parquet files queried by Impala were written by Hive or ParquetMR jobs. Set the block size to 268435456 (256 MB in bytes) if most Parquet files queried by Impala were written by Impala. -
DROP TABLE .. PURGE
is much faster than the defaultDROP TABLE
. The same applies toALTER TABLE ... DROP PARTITION PURGE
versus the defaultDROP PARTITION
operation. However, due to the eventually consistent nature of S3, the files for that table or partition could remain for some unbounded time when usingPURGE
. The defaultDROP TABLE/PARTITION
is slow because Impala copies the files to the HDFS trash folder, and Impala waits until all the data is moved.DROP TABLE/PARTITION .. PURGE
is a fast delete operation, and the Impala statement finishes quickly even though the change might not have propagated fully throughout S3. -
INSERT
statements are faster thanINSERT OVERWRITE
for S3. The query optionS3_SKIP_INSERT_STAGING
, which is set totrue
by default, skips the staging step for regularINSERT
(but notINSERT OVERWRITE
). This makes the operation much faster, but consistency is not guaranteed: if a node fails during execution, the table could end up with inconsistent data. Set this option tofalse
if stronger consistency is required, however this setting will make theINSERT
operations slower. -
Too many files in a table can make metadata loading and updating slow on S3. If too many requests are made to S3, S3 has a back-off mechanism and responds slower than usual. You might have many small files because of:
-
Too many partitions due to over-granular partitioning. Prefer partitions with many megabytes of data, so that even a query against a single partition can be parallelized effectively.
-
Many small
INSERT
queries. Prefer bulkINSERT
s so that more data is written to fewer files.
-