Configuring Hive on Spark

This topic explains the configuration properties you set up to run Hive on Spark.

Continue reading:

Installation Considerations
Enabling Hive on Spark
Configuration Properties
Configuring Hive
Configuring Spark
Enabling Spark Executor Allocation
Configuring Executor Memory Size

Installation Considerations

For Hive to work on Spark, you must deploy Spark gateway roles on the same machine that hosts HiveServer2. Otherwise, Hive on Spark cannot read from Spark configurations and cannot submit Spark jobs. For more information about gateway roles, see Managing Roles.

After installation, run the following command in Hive so that Hive will use Spark as the back-end engine for all subsequent queries.

set hive.execution.engine=spark;

Enabling Hive on Spark

By default, Hive on Spark is not enabled. To enable Hive on Spark, perform the following steps in Cloudera Manager.

Go to the Hive service.
Click the Configuration tab.
Enter Enable Hive on Spark in the Search field.
Check the box for Enable Hive on Spark (Unsupported).
Locate the Spark On YARN Service and click SPARK_ON_YARN.
Click Save Changes to commit the changes.

Configuration Properties

Property	Description
`hive.stats.collect.rawdatasize`	Hive on Spark uses statistics to determine the threshold for converting common join to map join. There are two types of statistics about the size of data: `totalSize`: The approximate size of data on the disk `rawDataSize`: The approximate size of data in memory When both metrics are available, Hive chooses `rawDataSize`. Default: `True`
`hive.auto.convert.join.noconditionaltask.size`	The threshold for the sum of all the small table size (by default, `rawDataSize`), for map join conversion. You can increase the value if you want better performance by converting more common joins to map joins. However, if you set this value too high, tasks may fail because too much memory is being used by data from small tables. Default: `20MB`

Property

Description

hive.stats.collect.rawdatasize

Hive on Spark uses statistics to determine the threshold for converting common join to map join. There are two types of statistics about the size of data:

totalSize: The approximate size of data on the disk
rawDataSize: The approximate size of data in memory

When both metrics are available, Hive chooses rawDataSize.

Default: True

hive.auto.convert.join.noconditionaltask.size

The threshold for the sum of all the small table size (by default, rawDataSize), for map join conversion. You can increase the value if you want better performance by converting more common joins to map joins. However, if you set this value too high, tasks may fail because too much memory is being used by data from small tables.

Default: 20MB

Configuring Hive

For improved performance, Cloudera recommends that you configure the following additional properties for Hive. Set these properties in Cloudera Manager Safety Valve for HiveServer2.

hive.stats.fetch.column.stats=true
hive.optimize.index.filter=true

Configuring Spark

Configure the following Spark properties to suit your cluster environment. During initial deployment, rules in Cloudera Manager tune this according to your cluster environment.

Property	Description
`spark.executor.cores`	The number of cores per Spark executor.
`spark.executor.memory`	The maximum size of each Spark executor's Java heap memory when Hive is running on Spark.
`spark.yarn.executor.memoryOverhead`	The amount of extra off-heap memory that can be requested from YARN, per executor process. Combined with `spark.executor.memory`, this is the total memory YARN can use to create a JVM for an executor process.
`spark.driver.memory`	The maximum size of each Spark driver's Java heap memory when Hive is running on Spark.
`spark.yarn.driver.memoryOverhead`	The amount of extra off-heap memory that can be requested from YARN per driver. Combined with `spark.driver.memory`, this is the total memory that YARN can use to create a JVM for a driver process.

Enabling Spark Executor Allocation

Spark can dynamically scale the set of cluster resources allocated to your application up and down, based on the workload. Dynamic allocation is useful when multiple applications share resources in a Spark cluster. When an application becomes idle, its resources can be released to the resource pool and acquired by other applications. Cloudera recommends that you enable dynamic allocation by setting spark.executor.dynamicAllocation.enabled to true. This is the default value in Cloudera Manager..

When you enable dynamic allocation, Spark adds and removes executors dynamically to Hive jobs, based on workload. The following table describes additional properties.

Property	Description
`spark.executor.dynamicAllocation.initialExecutors`	The initial number of executors for a Spark application when dynamic allocation is enabled. The default is `1`.
`spark.executor.dynamicAllocation.minExecutors`	The lower bound for the number of executors. The default is `1`.
`spark.executor.dynamicAllocation.maxExecutors`	The upper bound for the number of executors. The default is `Integer.MAX_VALUE`.

If you disable dynamic scaling, configure the following property:

Property	Description
`spark.executor.instances`	The total number of executors used for the Spark application.

Configuring Executor Memory Size

Executor memory size can have a number of effects on Hive. Increasing executor memory increases the number of queries for which Hive can enable mapjoin optimization. However, if there's too much executor memory, it takes longer to perform garbage collection. Also, some experiments shows that HDFS doesn’t handle concurrent writers well, so it may face a race condition if there are too many executor cores.

Cloudera recommends that you set the value for spark.executor.cores to 5, 6 or 7, depending on what the host is divisible by. For example, if yarn.nodemanager.resource.cpu-vcores is 19, then you would set the value to 6. Executors must have the same number of cores. If you set the value to 5, each executor only gets three cores, with four left unused. If you set the value to 7, only two executors are used, and five cores are unused. If the number of cores is 20, set the value to 5 so that each executor gets four cores, and no cores are unused.

Cloudera also recommends the following:

Compute a memory size equal to yarn.nodemanager.resource.memory-mb * (spark.executor.cores / yarn.nodemanager.resource.cpu-vcores) and then split that between spark.executor.memory and spark.yarn.executor.memoryOverhead.
spark.yarn.executor.memoryOverhead is 15-20% of the total memory size.

Running Hive on Spark

Troubleshooting Hive on Spark