Configuring Hive on Spark

This topic explains the configuration properties you set up to run Hive on Spark.

Installation Considerations

For Hive to work on Spark, you must deploy Spark gateway roles on the same machine that hosts HiveServer2. Otherwise, Hive on Spark cannot read from Spark configurations and cannot submit Spark jobs. For more information about gateway roles, see Managing Roles.

After installation, run the following command in Hive so that Hive will use Spark as the back-end engine for all subsequent queries.

set hive.execution.engine=spark;

Enabling Hive on Spark

By default, Hive on Spark is not enabled. To enable Hive on Spark, perform the following steps in Cloudera Manager.

  1. Go to the Hive service.
  2. Click the Configuration tab.
  3. Enter Enable Hive on Spark in the Search field.
  4. Check the box for Enable Hive on Spark (Unsupported).
  5. Locate the Spark On YARN Service and click SPARK_ON_YARN.
  6. Click Save Changes to commit the changes.

Configuration Properties

Property Description
hive.stats.collect.rawdatasize Hive on Spark uses statistics to determine the threshold for converting common join to map join. There are two types of statistics about the size of data:
  • totalSize: The approximate size of data on the disk
  • rawDataSize: The approximate size of data in memory

When both metrics are available, Hive chooses rawDataSize.

Default: True

hive.auto.convert.join.noconditionaltask.size The threshold for the sum of all the small table size (by default, rawDataSize), for map join conversion. You can increase the value if you want better performance by converting more common joins to map joins. However, if you set this value too high, tasks may fail because too much memory is being used by data from small tables.

Default: 20MB

Configuring Hive

For improved performance, Cloudera recommends that you configure the following additional properties for Hive. In Cloudera Manager, set these properties in the advanced configuration snippet for HiveServer2.

  • hive.stats.fetch.column.stats=true
  • hive.optimize.index.filter=true

Configuring Executor Memory Size

For general Spark configuration recommendations, see Configuring Spark on YARN Applications.

Executor memory size can have a number of effects on Hive. Increasing executor memory increases the number of queries for which Hive can enable mapjoin optimization. However, if there's too much executor memory, it takes longer to perform garbage collection. Also, some experiments shows that HDFS doesn’t handle concurrent writers well, so it may face a race condition if there are too many executor cores.

Cloudera recommends that you set the value for spark.executor.cores to 5, 6, or 7, depending on what the host is divisible by. For example, if yarn.nodemanager.resource.cpu-vcores is 19, then you would set the value to 6. Executors must have the same number of cores. If you set the value to 5, three executors with 5 cores each can be launched, leaving four cores unused. If you set the value to 7, only two executors are used, and five cores are unused. If the number of cores is 20, set the value to 5 so that each executor gets four cores, and no cores are unused.

Cloudera also recommends the following:
  • Compute a memory size equal to yarn.nodemanager.resource.memory-mb * (spark.executor.cores / yarn.nodemanager.resource.cpu-vcores) and then split that between spark.executor.memory and spark.yarn.executor.memoryOverhead.
  • spark.yarn.executor.memoryOverhead is 15-20% of the total memory size.