Hive Auto Convert Join Noconditional Size |
If Hive auto convert join is on, and the sum of the size for n-1 of the tables/partitions for a n-way join is smaller than the specified size, the join is directly converted to a MapJoin (there is no conditional task). |
hive.auto.convert.join.noconditionaltask.size
|
20 MiB |
hiveserver2_auto_convert_join_noconditionaltask_size
|
false |
Store Intermediate Data on Blobstore |
When writing data to a table on a blobstore (such as S3), whether or not the blobstore should be used to store intermediate data during Hive query execution. Setting this to true can degrade performance for queries that spawn multiple MR / Spark jobs, but is useful for queries whose intermediate data cannot fit in the allocated HDFS cluster. |
hive.blobstore.use.blobstore.as.scratchdir
|
false |
hiveserver2_blobstore_use_blobstore_as_scratchdir
|
false |
Enable Stats Optimization |
Enable optimization that checks if a query can be answered using statistics. If so, answers the query using only statistics stored in metastore. |
hive.compute.query.using.stats
|
false |
hiveserver2_compute_query_using_stats
|
false |
Hive on Spark Dynamic Partition Pruning for MapJoins |
Enables Dynamic Partition Pruning (DPP) for Hive on Spark jobs. DPP prunes partitions at runtime; it is triggered when a filter on a partitioned column cannot be evaluated at compile time. Only enables DPP for MapJoins where the join is on the partitioned column, and the partitioned table is treated as the big table. |
hive.spark.dynamic.partition.pruning.map.join.only
|
false |
hiveserver2_dynamic_partition_pruning_map_join_only
|
false |
Enable Cost-Based Optimizer for Hive |
Enabled the Calcite-based Cost-Based Optimizer for HiveServer2. |
hive.cbo.enable
|
true |
hiveserver2_enable_cbo
|
false |
Enable MapJoin Optimization |
Enable optimization that converts common join into MapJoin based on input file size. |
hive.auto.convert.join
|
true |
hiveserver2_enable_mapjoin
|
false |
Fetch Task Query Conversion |
Some select queries can be converted to a single FETCH task instead of a MapReduce task, minimizing latency. A value of none disables all conversion, minimal converts simple queries such as SELECT * and filter on partition columns, and more converts SELECT queries including FILTERS. |
hive.fetch.task.conversion
|
minimal |
hiveserver2_fetch_task_conversion
|
false |
Fetch Task Query Conversion Threshold |
Above this size, queries are converted to fetch tasks. |
hive.fetch.task.conversion.threshold
|
256 MiB |
hiveserver2_fetch_task_conversion_threshold
|
false |
Input Listing Max Threads |
Maximum number of threads that Hive uses to list input files. Increasing this value can improve performance when there are a lot of partitions being read, or when running on blobstores. |
hive.exec.input.listing.max.threads
|
15 |
hiveserver2_input_listing_max_threads
|
false |
Maximum ReduceSink Top-K Memory Usage |
The maximum percentage of heap to be used for hash in ReduceSink operator for Top-K selection. 0 means the optimization is disabled. Accepted values are between 0 and 1. |
hive.limit.pushdown.memory.usage
|
0.1 |
hiveserver2_limit_pushdown_memory_usage
|
false |
Load Dynamic Partitions Thread Count |
Number of threads used to load dynamically generated partitions. Loading requires renaming the file its final location, and updating some metadata about the new partition. Increasing this can improve performance when there are a lot of partitions dynamically generated. |
hive.load.dynamic.partitions.thread
|
15 |
hiveserver2_load_dynamic_partitions_thread_count
|
false |
Enable Map-Side Aggregation |
Enable map-side partial aggregation, which cause the mapper to generate fewer rows. This reduces the data to be sorted and distributed to reducers. |
hive.map.aggr
|
true |
hiveserver2_map_aggr
|
false |
Ratio of Memory Usage for Map-Side Aggregation |
Portion of total memory used in map-side partial aggregation. When exceeded, the partially aggregated results will be flushed from the map task to the reducers. |
hive.map.aggr.hash.percentmemory
|
0.5 |
hiveserver2_map_aggr_hash_memory_ratio
|
false |
Enable Merging Small Files - Map-Only Job |
Merge small files at the end of a map-only job. When enabled, a map-only job is created to merge the files in the destination table/partitions. |
hive.merge.mapfiles
|
true |
hiveserver2_merge_mapfiles
|
false |
Enable Merging Small Files - Map-Reduce Job |
Merge small files at the end of a map-reduce job. When enabled, a map-only job is created to merge the files in the destination table/partitions. |
hive.merge.mapredfiles
|
false |
hiveserver2_merge_mapredfiles
|
false |
Desired File Size After Merging |
The desired file size after merging. This should be larger than hive.merge.smallfiles.avgsize. |
hive.merge.size.per.task
|
256 MiB |
hiveserver2_merge_size_per_task
|
false |
Small File Average Size Merge Threshold |
When the average output file size of a job is less than the value of this property, Hive will start an additional map-only job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, for map-reduce jobs if hive.merge.mapredfiles is true, and for Spark jobs if hive.merge.sparkfiles is true. |
hive.merge.smallfiles.avgsize
|
16 MiB |
hiveserver2_merge_smallfiles_avgsize
|
false |
Enable Merging Small Files - Spark Job |
Merge small files at the end of a Spark job. When enabled, a map-only job is created to merge the files in the destination table/partitions. |
hive.merge.sparkfiles
|
true |
hiveserver2_merge_sparkfiles
|
false |
MSCK Repair Batch Size |
Batch size for the msck repair command (recover partitions command). If the value is greater than zero, new partition information will be sent from HiveServer2 to the Metastore in batches, which can potentially improve memory usage in the Metastore and avoid client read timeout exceptions. If this value is 0, all partition information will sent in a single Thrift call. |
hive.msck.repair.batch.size
|
0 |
hiveserver2_msck_repair_batch_size
|
false |
Move Files Thread Count |
The number of threads used by HiveServer2 to move data from the staging directory to another location (typically to the final table location). A separate thread pool of workers of this size is used for each query, which means this configuration can be set on a per-query basis too. |
hive.mv.files.thread
|
15 |
hiveserver2_mv_files_thread
|
false |
Hive Optimize Sorted Merge Bucket Join |
Whether to try sorted merge bucket (SMB) join. |
hive.optimize.bucketmapjoin.sortedmerge
|
false |
hiveserver2_optimize_bucketmapjoin_sortedmerge
|
false |
Enable Automatic Use of Indexes |
Whether to use the indexing optimization for all queries. |
hive.optimize.index.filter
|
true |
hiveserver2_optimize_index_filter
|
false |
Enable ReduceDeDuplication Optimization |
Remove extra map-reduce jobs if the data is already clustered by the same key, eliminating the need to repartition the dataset again. |
hive.optimize.reducededuplication
|
true |
hiveserver2_optimize_reducededuplication
|
false |
Mininum Reducers for ReduceDeDuplication Optimization |
When the number of ReduceSink operators after merging is less than this number, the ReduceDeDuplication optimization will be disabled. |
hive.optimize.reducededuplication.min.reducer
|
4 |
hiveserver2_optimize_reducededuplication_min_reducer
|
false |
Enable Sorted Dynamic Partition Optimizer |
When dynamic partition is enabled, reducers keep only one record writer at all times, which lowers the memory pressure on reducers. |
hive.optimize.sort.dynamic.partition
|
false |
hiveserver2_optimize_sort_dynamic_partition
|
false |
Enable Parallel Compilation of Queries |
When activated, individual sessions can compile queries simultaneously. Within each session, queries compile one at a time. |
hive.driver.parallel.compilation
|
false |
hiveserver2_parallel_compilation_enabled
|
false |
Query Compilation Degree of Parallelism |
Determines the maximum number of queries that can compile in parallel on a HiveServer2 instance. Use negative values or zero to set unlimited parallelism. Use a positive value to set the number of queries that can compile simultaneously. This setting can be fine-tuned based on the current cluster load. Monitor cluster load using the 'waiting_compile_ops' metric and the 'Waiting Compile Operations' graph in the HiveServer2 graph library. |
hive.driver.parallel.compilation.global.limit
|
3 |
hiveserver2_parallel_compilation_global_limit
|
false |
Hive SMB Join Cache Rows |
The number of rows with the same key value to be cached in memory per SMB-joined table. |
hive.smbjoin.cache.rows
|
10000 |
hiveserver2_smbjoin_cache_rows
|
false |
Load Column Statistics |
Whether column stats for a table are fetched during explain. |
hive.stats.fetch.column.stats
|
true |
hiveserver2_stats_fetch_column_stats
|
false |
Vectorized Adapter Usage Mode |
Vectorized Adaptor Usage Mode specifies the extent to which the vectorization engine tries to vectorize UDFs that do not have native vectorized versions available. Selecting the "none" option specifies that only queries using native vectorized UDFs are vectorized. Selecting the "chosen" option specifies that Hive choses to vectorize a subset of the UDFs based on performance benefits using the Vectorized Adaptor. Selecting the "all" option specifies that the Vectorized Adaptor be used for all UDFs even when native vectorized versions are not available. |
hive.vectorized.adaptor.usage.mode
|
chosen |
hiveserver2_vectorized_adaptor_usage_mode
|
false |
Enable Vectorization Optimization |
Enable optimization that vectorizes query execution by streamlining operations by processing a block of 1024 rows at a time. |
hive.vectorized.execution.enabled
|
true |
hiveserver2_vectorized_enabled
|
false |
Vectorized GroupBy Check Interval |
In vectorized group-by, the number of row entries added to the hash table before re-checking average variable size for memory usage estimation. |
hive.vectorized.groupby.checkinterval
|
4096 |
hiveserver2_vectorized_groupby_checkinterval
|
false |
Vectorized GroupBy Flush Ratio |
Ratio between 0.0 and 1.0 of entries in the vectorized group-by aggregation hash that is flushed when the memory threshold is exceeded. |
hive.vectorized.groupby.flush.percent
|
0.1 |
hiveserver2_vectorized_groupby_flush_ratio
|
false |
Enable Vectorized Input Format |
If enabled, Hive uses the native vectorized input format for vectorized query execution when it is available. |
hive.vectorized.use.vectorized.input.format
|
true |
hiveserver2_vectorized_input_format_enabled
|
false |
Exclude Vectorized Input Formats |
Specifies a list of file input format classnames to exclude from vectorized query execution using the vectorized input format. Note that vectorized execution can still occur for an excluded input format based on whether row SerDes or vector SerDes are enabled. |
hive.vectorized.input.format.excludes
|
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat |
hiveserver2_vectorized_input_format_excludes
|
false |
Enable Reduce-Side Vectorization |
Whether to vectorize the reduce side of query execution. |
hive.vectorized.execution.reduce.enabled
|
true |
hiveserver2_vectorized_reduce_enabled
|
false |
Enable Overflow-checked Vector Expressions |
To enhance performance, vectorized expressions operate using wide data types like long and double. When wide data types are used, numeric overflows can occur during expression evaluation in a different manner for vectorized expressions than they do for non-vectorized expressions. Consequently, different query results can be returned for vectorized expressions compared to results returned for non-vectorized expressions. When this configuration is enabled, Hive uses vectorized expressions that handle numeric overflows in the same way as non-vectorized expressions are handled. |
hive.vectorized.use.checked.expressions
|
true |
hiveserver2_vectorized_use_checked_expressions
|
false |
Vectorize Using Vector SerDes |
If enabled, Hive uses built-in vector SerDes to process text and sequencefile tables for vectorized query execution. |
hive.vectorized.use.vector.serde.deserialize
|
false |
hiveserver2_vectorized_use_vector_serde_deserialize
|
false |
Maximum Process File Descriptors |
If configured, overrides the process soft and hard rlimits (also called ulimits) for file descriptors to the configured value. |
|
|
rlimit_fds
|
false |