Impala with Ozone
You can use Impala to query data files that reside on Apache Ozone distributed storage, rather than in HDFS.
The typical use case for Impala and Ozone together is to use Ozone for the default filesystem,
replacing HDFS entirely. In this configuration, when you create a database, table, or partition,
the data always resides on Ozone storage and you do not need to specify any special
LOCATION
attribute. If you do specify a LOCATION
attribute,
its value refers to a path within the Ozone filesystem.
For example:
If the default filesystem is Ozone, all Impala data resides there and all Impala databases and tables are also located there.
CREATE TABLE t1 (x INT, s STRING);
You can specify LOCATION for
database, table, or partition, using values from the Ozone filesystem.
CREATE DATABASE d1 LOCATION
'/some/path/on/ozone/server/d1.db'; CREATE TABLE d1.t2 (a TINYINT, b BOOLEAN);
Impala can write to, delete, and rename data files and database, table, and partition
directories on Ozone storage. Therefore, Impala statements such as CREATE TABLE
,
DROP TABLE
, CREATE DATABASE
, DROP DATABASE
,
ALTER TABLE
, and INSERT
work the same with Ozone storage as
with HDFS.
Ozone supports multiple protocols: ofs
, o3fs
, and
s3a
. Impala supports reading ofs
and o3fs
.
Impala can also read s3a
(see Impala with Amazon S3 for more information).
However, ofs
is their newer protocol, and the only one Impala supports as a
default filesystem. We recommend using it for DDL statements to avoid access limitations, and for
DML statements and SELECT statements for performance.
Because Apache Ozone storage buckets use a global value for the block size rather than a
configurable value for each file, the PARQUET_FILE_SIZE
query option has no
effect when Impala inserts data into a table or partition residing on Ozone storage.
Impala's spill-to-disk feature may be configured to use Ozone storage by specifying a full URI
(e.g. ofs://host:port/volume/bucket/key
) for the spill location.
See Spill to remote storage for details on configuring remote spill-to-disk.