Configuring Hive on Spark
This topic explains the configuration properties you set up to run Hive on Spark.
Continue reading:
Installation Considerations
For Hive to work on Spark, you must deploy Spark gateway roles on the same machine that hosts HiveServer2. Otherwise, Hive on Spark cannot read from Spark configurations and cannot submit Spark jobs. For more information about gateway roles, see Managing Roles.
After installation, run the following command in Hive so that Hive will use Spark as the back-end engine for all subsequent queries.
set hive.execution.engine=spark;
Enabling Hive on Spark
By default, Hive on Spark is not enabled. To enable Hive on Spark, perform the following steps in Cloudera Manager.
- Go to the Hive service.
- Click the Configuration tab.
- Enter Enable Hive on Spark in the Search field.
- Check the box for Enable Hive on Spark (Unsupported).
- Locate the Spark On YARN Service and click SPARK_ON_YARN.
- Click Save Changes to commit the changes.
Configuration Properties
Property | Description |
---|---|
hive.stats.collect.rawdatasize | Hive on Spark uses statistics to determine the threshold for converting common join to map join. There are two types of statistics
about the size of data:
When both metrics are available, Hive chooses rawDataSize. Default: True |
hive.auto.convert.join.noconditionaltask.size | The threshold for the sum of all the small table size (by default, rawDataSize), for map join
conversion. You can increase the value if you want better performance by converting more common joins to map joins. However, if you set this value too high, tasks may fail because too much memory is
being used by data from small tables.
Default: 20MB |
Configuring Hive
For improved performance, Cloudera recommends that you configure the following additional properties for Hive. Set these properties in Cloudera Manager Safety Valve for HiveServer2.
- hive.stats.fetch.column.stats=true
- hive.optimize.index.filter=true
Configuring Spark
Configure the following Spark properties to suit your cluster environment. During initial deployment, rules in Cloudera Manager tune this according to your cluster environment.
Property | Description |
---|---|
spark.executor.cores | The number of cores per Spark executor. |
spark.executor.memory | The maximum size of each Spark executor's Java heap memory when Hive is running on Spark. |
spark.yarn.executor.memoryOverhead | The amount of extra off-heap memory that can be requested from YARN, per executor process. Combined with spark.executor.memory, this is the total memory YARN can use to create a JVM for an executor process. |
spark.driver.memory | The maximum size of each Spark driver's Java heap memory when Hive is running on Spark. |
spark.yarn.driver.memoryOverhead | The amount of extra off-heap memory that can be requested from YARN per driver. Combined with spark.driver.memory, this is the total memory that YARN can use to create a JVM for a driver process. |
Enabling Spark Executor Allocation
Spark can dynamically scale the set of cluster resources allocated to your application up and down, based on the workload. Dynamic allocation is useful when multiple applications share resources in a Spark cluster. When an application becomes idle, its resources can be released to the resource pool and acquired by other applications. Cloudera recommends that you enable dynamic allocation by setting spark.executor.dynamicAllocation.enabled to true. This is the default value in Cloudera Manager..
When you enable dynamic allocation, Spark adds and removes executors dynamically to Hive jobs, based on workload. The following table describes additional properties.
Property | Description |
---|---|
spark.executor.dynamicAllocation.initialExecutors | The initial number of executors for a Spark application when dynamic allocation is enabled. The default is 1. |
spark.executor.dynamicAllocation.minExecutors | The lower bound for the number of executors. The default is 1. |
spark.executor.dynamicAllocation.maxExecutors | The upper bound for the number of executors. The default is Integer.MAX_VALUE. |
If you disable dynamic scaling, configure the following property:
Property | Description |
---|---|
spark.executor.instances | The total number of executors used for the Spark application. |
Configuring Executor Memory Size
Executor memory size can have a number of effects on Hive. Increasing executor memory increases the number of queries for which Hive can enable mapjoin optimization. However, if there's too much executor memory, it takes longer to perform garbage collection. Also, some experiments shows that HDFS doesn’t handle concurrent writers well, so it may face a race condition if there are too many executor cores.
Cloudera recommends that you set the value for spark.executor.cores to 5, 6 or 7, depending on what the host is divisible by. For example, if yarn.nodemanager.resource.cpu-vcores is 19, then you would set the value to 6. Executors must have the same number of cores. If you set the value to 5, each executor only gets three cores, with four left unused. If you set the value to 7, only two executors are used, and five cores are unused. If the number of cores is 20, set the value to 5 so that each executor gets four cores, and no cores are unused.
- Compute a memory size equal to yarn.nodemanager.resource.memory-mb * (spark.executor.cores / yarn.nodemanager.resource.cpu-vcores) and then split that between spark.executor.memory and spark.yarn.executor.memoryOverhead.
- spark.yarn.executor.memoryOverhead is 15-20% of the total memory size.