Appropriate file formats
Following are recommendations for which file formats provide the best performance in Impala.
By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.
- For BI queries, the Parquet file format performs best because of its
combination of columnar storage layout, compression, and encoding.
The default setting for
COMPRESSION_CODEC
is snappy compression, but GZip compression is also supported. - Impala also supports reading ORC file formats from version 2.12 and onwards, however expect query performance with ORC tables to be slower than it is with Parquet tables.
- Text formats can be used when all columns are retrieved from a table. However, because compression on text is lower, HDFS I/O could be longer than when you use the Parquet file format.