Improving Hive Performance with S3/ADLS/WASB
Tune the following parameters to improve Hive performance when working with S3, ADLS or WASB.
Table 6.1. Improving General Performance
Parameter | Recommended Setting |
---|---|
yarn.scheduler.capacity.node-locality-delay | Set this to "0". |
hive.warehouse.subdir.inherit.perms | Set this to "false" to reduce the number of file permission checks. |
hive.metastore.pre.event.listeners | Set this to an empty value to reduce the number of directory permission checks. |
You can set these parameters in hive-site.xml
.
Table 6.2. Accelerating ORC Reads in Hive
Parameter | Recommended Setting |
---|---|
hive.orc.compute.splits.num.threads | If using ORC format and you want improve the split computation time, you can set the value of this parameter to match the number of available processors. By default, this parameter is set to 10. This parameter controls the number of parallel threads involved in computing splits. For Parquet computing splits is still single-threaded, so split computations can take longer with Parquet and S3/ADLS/WASB. |
hive.orc.splits.include.file.footer | If using ORC format with ETL file split strategy, you can set this parameter to "true" in order to use existing file footer information in split payload. |
You can set these parameters using --hiveconf
option in Hive CLI or using the
set
command in Beeline.
Table 6.3. Accelerating ETL Jobs
Parameter | Recommended Setting |
---|---|
| Query launches can be slightly slower if there are no stats available or when
Tuning |
fs.trash.interval | Drop table can be slow in object stores such as S3 because the action involves
moving files to trash (a copy + delete). To remedy this, you can set
fs.trash.interval=0 to completely skip trash. |
You can set these parameters using --hiveconf
option in Hive CLI or using the
set
command in Beeline.
Accelerating Inserts in Hive
When inserting data, Hive moves data from a temporary folder to the final location. This move operation is actually a copy+delete action, which is expensive in object stores such as S3; the more data is being written out to the object store, the more expensive the operation is.
To accelerate the process, you can tune hive.mv.files.thread
, depending on
the size of your dataset (default is 15). You can set it in hive-site.xml
.