Impala with Ozone
You can use Impala to query data files that reside on Apache Ozone distributed storage, rather than in HDFS.
The typical use case for Impala and Ozone together is to use Ozone for the default filesystem,
replacing HDFS entirely. In this configuration, when you create a database, table, or partition,
the data always resides on Ozone storage and you do not need to specify any special
LOCATION attribute. If you do specify a
its value refers to a path within the Ozone filesystem.
If the default filesystem is Ozone, all Impala data resides there and all Impala databases and tables are also located there.
You can specify LOCATION for database, table, or partition, using values from the Ozone filesystem.
CREATE TABLE t1 (x INT, s STRING);
CREATE DATABASE d1 LOCATION '/some/path/on/ozone/server/d1.db'; CREATE TABLE d1.t2 (a TINYINT, b BOOLEAN);
Impala can write to, delete, and rename data files and database, table, and partition
directories on Ozone storage. Therefore, Impala statements such as
ALTER TABLE, and
INSERT work the same with Ozone storage as
Ozone supports multiple protocols:
s3a. Impala supports reading
Impala can also read
s3a (see Impala with Amazon S3 for more information).
ofs is their newer protocol, and the only one Impala supports as a
default filesystem. We recommend using it for DDL statements to avoid access limitations, and for
DML statements and SELECT statements for performance.
Because Apache Ozone storage buckets use a global value for the block size rather than a
configurable value for each file, the
PARQUET_FILE_SIZE query option has no
effect when Impala inserts data into a table or partition residing on Ozone storage.
Impala's spill-to-disk feature may be configured to use Ozone storage by specifying a full URI
ofs://host:port/volume/bucket/key) for the spill location.
See Spill to remote storage for details on configuring remote spill-to-disk.