Accelerating ORC and Parquet Reads
Use Random Read Policy
When reading binary ORC and Parquet datasets, you should configure Spark to use the
S3A's random
IO read policy, as described in
Optimizing HTTP GET Requests for S3. With
fs.s3a.experimental.input.fadvise
set to random
, rather than
ask for a the entire file in one HTTPS request (the "normal" operation), the S3A connector
only asks for part of a file at a time. If it needs to seek backwards, the remaining data
in this part is discarded, and then a new request is made on the same HTTPS connection.
This reduces the time wasted on closing and opening up new HTTPS connections.
This setting dramatically speeds up random access, but actually reduces performance on
queries performing sequential reads through an entire file — so do not use
random
setting for such jobs.
Minimize Read and Write Operations for ORC
For optimal performance when reading files saved in the ORC format, read and write operations must be minimized. To achieve this, set the following options:
spark.sql.orc.filterPushdown true spark.sql.hive.metastorePartitionPruning true
The spark.sql.orc.filterPushdown
option enables the ORC library to skip
unneeded columns and to use index information to filter out parts of the file where it can
be determined that no columns match the predicate.
With the spark.sql.hive.metastorePartitionPruning
option enabled,
predicates are pushed down into the Hive metastore to eliminate unmatched partitions.
Minimize Read and Write Operations for Parquet
For optimal performance when reading files saved in the Parquet format, read and write operations must be minimized, including generation of summary metadata, and coalescing metadata from multiple files. The predicate pushdown option enables the Parquet library to skip unneeded columns, saving bandwidth. To achieve this, set the following options:
spark.hadoop.parquet.enable.summary-metadata false spark.sql.parquet.mergeSchema false spark.sql.parquet.filterPushdown true spark.sql.hive.metastorePartitionPruning true