Apache Hive Performance Tuning
Also available as:
PDF

Maximizing storage resources using ORC

You can conserve storage in a number of ways, but using the Optimized Row Columnar (ORC) file format is most effective.

The ORC file format for data storage is recommended for the following reasons:

  • Efficient compression: Stored as columns and compressed, which leads to smaller disk reads. The columnar format is also ideal for vectorization optimizations in Tez.
  • Fast reads: ORC has a built-in index, min/max values, and other aggregates that cause entire stripes to be skipped during reads. In addition, predicate pushdown pushes filters into reads so that minimal rows are read. And Bloom filters further reduce the number of rows that are returned.
  • Proven in large-scale deployments: Facebook uses the ORC file format for a 300+ PB deployment.

ORC provides the best Hive performance overall. In addition, to specifying the storage format, you can also specify a compression algorithm for the table, as shown in the following example:

CREATE TABLE addresses (
 name string,
 street string,
 city string,
 state string,
 zip int
 ) STORED AS orc TBLPROPERTIES ("orc.compress"="Zlib");

Setting the compression algorithm is usually not required because your Hive settings include a default algorithm. Using ORC advanced properties, you can create bloom filters for columns frequently used in point lookups.

You can read a table and create a copy in ORC using the following command:

CREATE TABLE a_orc STORED AS ORC AS SELECT * FROM A; 

A common practice is to store data in HDFS as text, create a Hive external table over it, and then store the data as ORC inside Hive where it becomes a Hive-managed table.

Hive supports Parquet and other formats. You can also write your own SerDes (Serializers, Deserializers) interface to support custom file formats.

Storage layer

While a Hive enterprise data warehouse (EDW) can run on one of a variety of storage layers, HDFS and Amazon S3 are among the most frequently used for data analytics in the Hadoop stack. Amazon S3 is a commonly used for a public cloud infrastructure.

A Hive EDW can store data on other file systems, including WASB and ADLS.

Depending on your environment, you can tune the file system to optimize Hive performance by configuring compression format, stripe size, partitions, and buckets.