ORC vs Parquet in CDP

The differences between Optimized Row Columnar (ORC) file format for storing Hive data and Parquet for storing Impala data are important to understand. Query performance improves when you use the appropriate format for your application.

ORC and Parquet capabilities comparison

The following table compares Hive and Impala support for ORC and Parquet in CDP Public Cloud and CDP Private Cloud Base. The Runtime Services column shows the supported services:
  • Hive-on-Tez
  • HiveLLAP, supported on CDP Public Cloud only
  • Hive metastore (HMS)
  • Impala
  • Spark
  • JDBC
Table 1.
Capability Data Warehouse ORC Parquet Runtime Services
Read non-transactional data Apache Hive (Hive-on-Tez | HiveLLAP) & HMS
Read non-transactional data Apache Impala Impala & HMS
Full ACID transactions Apache Hive (Hive-on-Tez | HiveLLAP) & HMS
Read Insert-only transactions Apache Impala Impala & HMS
Hive Warehouse Connector reads Apache Hive ((Hive-on-Tez & JDBC) | HiveLLAP) & Spark & HMS
Hive Warehouse Connector writes Apache Hive ((Hive-on-Tez & JDBC) | HiveLLAP) & Spark & HMS
Column index Apache Hive (Hive-on-Tez | HiveLLAP) & HMS
Column index Apache Impala Impala & HMS
CBO uses column metadata Apache Hive (Hive-on-Tez | HiveLLAP) & HMS
Recommended format Apache Hive (Hive-on-Tez | HiveLLAP) & HMS
Recommended format Apache Impala Impala & HMS
Vectorized reader Apache Hive (Hive-on-Tez | HiveLLAP) & HMS
Read complex types Apache Impala Impala & HMS
Read/write complex types Apache Hive (Hive-on-Tez | HiveLLAP) & HMS