Using ORC Data Files
Impala can read ORC data files.
Creating ORC Tables and Loading Data
To create a table in the ORC format, use the STORED AS
ORC
clause in the CREATE TABLE
statement.
Because Impala can query ORC tables but cannot currently write to, after creating ORC tables, use the Hive shell to load the data.
After loading data into a table through Hive or other mechanism outside
of Impala, issue a REFRESH
table_name
statement the next time you
connect to the Impala node, before querying the table, to make Impala
recognize the new data.
For example, here is how you might create some ORC tables in Impala (by specifying the columns explicitly, or cloning the structure of another table), load data through Hive, and query them through Impala:
$ impala-shell -i localhost
[localhost:21000] default> CREATE TABLE orc_table (x INT) STORED AS ORC;
[localhost:21000] default> CREATE TABLE orc_clone LIKE some_other_table STORED AS ORC;
[localhost:21000] default> quit;
$ hive
hive> INSERT INTO TABLE orc_table SELECT x FROM some_other_table;
3 Rows loaded to orc_table
Time taken: 4.169 seconds
hive> quit;
$ impala-shell -i localhost
[localhost:21000] default> -- Make Impala recognize the data loaded through Hive;
[localhost:21000] default> REFRESH orc_table;
[localhost:21000] default> SELECT * FROM orc_table;
Fetched 3 row(s) in 0.11s
Data Type Considerations for ORC Tables
The ORC format defines a set of data types whose names differ from the names of the corresponding Impala data types. The following figure lists the ORC-defined types and the equivalent types in Impala.
Primitive types:
ORC type | Impala type |
---|---|
BINARY | STRING |
BOOLEAN | BOOLEAN |
DOUBLE | DOUBLE |
FLOAT | FLOAT |
TINYINT | TINYINT |
SMALLINT | SMALLINT |
INT | INT |
BIGINT | BIGINT |
TIMESTAMP | TIMESTAMP |
DATE | Not supported |
Complex types:
Impala supports the complex types
ARRAY
, STRUCT
, and
MAP
in ORC files. Because Impala has better
performance on Parquet than ORC, if you plan to use complex types,
become familiar with the performance and storage aspects of Parquet
first.
Enabling Compression for ORC Tables
ORC tables are in ZLIB (Deflate in Impala) compression by default. The supported compressions for ORC tables are NONE, ZLIB, and SNAPPY.
Set the compression when you load ORC tables in Hive.