Cloud Data Access
Also available as:
PDF
loading table of contents...

Accelerating ORC and Parquet Reads

Use Random Read Policy

When reading binary ORC and Parquet datasets, you should configure Spark to use the S3A's random IO read policy, as described in Optimizing HTTP GET Requests for S3. With fs.s3a.experimental.input.fadvise set to random, rather than ask for a the entire file in one HTTPS request (the "normal" operation), the S3A connector only asks for part of a file at a time. If it needs to seek backwards, the remaining data in this part is discarded, and then a new request is made on the same HTTPS connection. This reduces the time wasted on closing and opening up new HTTPS connections.

This setting dramatically speeds up random access, but actually reduces performance on queries performing sequential reads through an entire file — so do not use random setting for such jobs.

Minimize Read and Write Operations for ORC

For optimal performance when reading files saved in the ORC format, read and write operations must be minimized. To achieve this, set the following options:

spark.sql.orc.filterPushdown true
spark.sql.hive.metastorePartitionPruning true

The spark.sql.orc.filterPushdown option enables the ORC library to skip unneeded columns and to use index information to filter out parts of the file where it can be determined that no columns match the predicate.

With the spark.sql.hive.metastorePartitionPruning option enabled, predicates are pushed down into the Hive metastore to eliminate unmatched partitions.

Minimize Read and Write Operations for Parquet

For optimal performance when reading files saved in the Parquet format, read and write operations must be minimized, including generation of summary metadata, and coalescing metadata from multiple files. The predicate pushdown option enables the Parquet library to skip unneeded columns, saving bandwidth. To achieve this, set the following options:

spark.hadoop.parquet.enable.summary-metadata false
spark.sql.parquet.mergeSchema false
spark.sql.parquet.filterPushdown true
spark.sql.hive.metastorePartitionPruning true