Appropriate file formats

Following are recommendations for which file formats provide the best performance in Impala.

By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.

  • For BI queries, the Parquet file format performs best because of its combination of columnar storage layout, compression, and encoding. The default setting for COMPRESSION_CODEC is snappy compression, but GZip compression is also supported.
  • Impala also supports reading ORC file formats from version 2.12 and onwards, however expect query performance with ORC tables to be slower than it is with Parquet tables.
  • Text formats can be used when all columns are retrieved from a table. However, because compression on text is lower, HDFS I/O could be longer than when you use the Parquet file format.