Using ORC Data Files

Impala can read ORC data files as an experimental feature.

Creating ORC Tables and Loading Data

To enable the support of ORC data files:
  1. In Cloudera Manager, navigate to Clusters > Impala.
  2. In the Configuration tab, set --enable_orc_scanner=true in the Impala Command Line Argument Advanced Configuration Snippet (Safety Valve)field.
  3. Restart the cluster.

To create a table in the ORC format, use the STORED AS ORC clause in the CREATE TABLE statement.

Because Impala can query ORC tables but cannot currently write to, after creating ORC tables, use the Hive shell to load the data.

After loading data into a table through Hive or other mechanism outside of Impala, issue a REFRESH table_name statement the next time you connect to the Impala node, before querying the table, to make Impala recognize the new data.

For example, here is how you might create some ORC tables in Impala (by specifying the columns explicitly, or cloning the structure of another table), load data through Hive, and query them through Impala:

$ impala-shell -i localhost
[localhost:21000] default> CREATE TABLE orc_table (x INT) STORED AS ORC;
[localhost:21000] default> CREATE TABLE orc_clone LIKE some_other_table STORED AS ORC;
[localhost:21000] default> quit;

$ hive
hive> INSERT INTO TABLE orc_table SELECT x FROM some_other_table;
3 Rows loaded to orc_table
Time taken: 4.169 seconds
hive> quit;

$ impala-shell -i localhost
[localhost:21000] default> -- Make Impala recognize the data loaded through Hive;
[localhost:21000] default> REFRESH orc_table;
[localhost:21000] default> SELECT * FROM orc_table;

Fetched 3 row(s) in 0.11s

Data Type Considerations for ORC Tables

The ORC format defines a set of data types whose names differ from the names of the corresponding Impala data types. If you are preparing ORC files using other Hadoop components such as Pig or MapReduce, you might need to work with the type names defined by ORC. The following figure lists the ORC-defined types and the equivalent types in Impala.

Primitive types:

ORC type Impala type
BINARY STRING
BOOLEAN BOOLEAN
DOUBLE DOUBLE
FLOAT FLOAT
TINYINT TINYINT
SMALLINT SMALLINT
INT INT
BIGINT BIGINT
TIMESTAMP TIMESTAMP
DATE Not supported

Complex types:

Impala supports the complex types ARRAY, STRUCT, and MAP in ORC files. Because Impala has better performance on Parquet than ORC, if you plan to use complex types, become familiar with the performance and storage aspects of Parquet first.

Enabling Compression for ORC Tables

ORC tables are in ZLIB (Deflate in Impala) compression by default. The supported compressions for ORC tables are NONE, ZLIB, SNAPPY and LZO.

Set the compression when you load ORC tables in Hive.