Using RCFile Data Files
Impala supports creating and querying RCFile tables.
Creating RCFile Tables and Loading Data
To create a table in the RCFile format, use the STORED AS
RCFILE
clause in the CREATE TABLE
statement.
Because Impala can query RCFile tables but cannot currently write to, after creating RCFile tables, use the Hive shell to load the data.
After loading data into a table through Hive or other mechanism outside
of Impala, issue a REFRESH
table_name
statement the next time you
connect to the Impala node, before querying the table, to make Impala
recognize the new data.
Query Performance for RCFile Tables
In general, expect query performance with RCFile tables to be faster than with tables using text data, but slower than with Parquet tables.
Enabling Compression for RCFile Tables
You may want to enable compression on existing tables. Enabling compression provides performance gains in most cases and is supported for RCFile tables.
The compression type is specified in the SET
mapred.output.compression.codec
command.
For example, to enable Snappy compression, you would specify the following additional settings when loading data through the Hive shell.
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;