Running Hive on Spark

This section explains how to run Hive using the Spark execution engine. It assumes that the cluster is managed by Cloudera Manager.

Configuring Hive on Spark

Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)

To configure Hive to run on Spark do both of the following steps:

Configuring the Hive Dependency on a Spark Service

By default, if a Spark service is available, the Hive dependency on the Spark service is configured. To change this configuration, do the following:

  1. Go to the Hive service.
  2. Click the Configuration tab.
  3. Search for the Spark On YARN Service. To configure the Spark service, select the Spark service name. To remove the dependency, select none.
  4. Click Save Changes to commit the changes.
  5. Go to the Spark service.
  6. Add a Spark gateway role to the host running HiveServer2.
  7. Return to the Home page by clicking the Cloudera Manager logo.
  8. Click to invoke the cluster restart wizard.
  9. Click Restart Stale Services.
  10. Click Restart Now.
  11. Click Finish.
  12. In the Hive client, configure the Spark execution engine.

Configuring Hive on Spark for Performance

For the configuration automatically applied by Cloudera Manager when the Hive on Spark service is added to a cluster, see Hive on Spark Autoconfiguration.

For information on configuring Hive on Spark for performance, see Tuning Hive on Spark.

Using Hive UDFs with Hive on Spark

When the execution engine is set to Spark, use Hive UDFs the same way that you use them when the execution engine is set to MapReduce. To apply a custom UDF on the column of a Hive table, use the following syntax:

SELECT <custom_UDF_name>(<column_name>) FROM <table_name>;

For example, to apply the custom UDF addfunc10 to the salary column of the sample_07 table in the default database that ships with CDH, use the following syntax:

SELECT addfunc10(salary) FROM sample_07 LIMIT 10;

The above HiveQL statement returns only 10 rows from the sample_07 table.

To use Hive built-in UDFs, see the LanguageManual UDF on the Apache wiki. To create custom UDFs in Hive, see Managing Apache Hive User-Defined Functions.

Troubleshooting Hive on Spark

Delayed result from the first query after starting a new Hive on Spark session


The first query after starting a new Hive on Spark session might be delayed due to the start-up time for the Spark on YARN cluster.


The query waits for YARN containers to initialize.


No action required. Subsequent queries will be faster.

Exception in HiveServer2 log and HiveServer2 is down


In the HiveServer2 log you see the following exception: Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0)


HiveServer2 memory is set too small. For more information, see stdout for HiveServer2.


  1. Go to the Hive service.
  2. Click the Configuration tab.
  3. Search for Java Heap Size of HiveServer2 in Bytes, and increase the value. Cloudera recommends a minimum value of 2 GB.
  4. Click Save Changes to commit the changes.
  5. Restart HiveServer2.

Out-of-memory error


In the log you see an out-of-memory error similar to the following:
15/03/19 03:43:17 WARN channel.DefaultChannelPipeline:
An exception was thrown by a user handler while handling an exception event ([id: 0x9e79a9b1, / => /]
      EXCEPTION: java.lang.OutOfMemoryError: Java heap space)
      java.lang.OutOfMemoryError: Java heap space


The Spark driver does not have enough off-heap memory.


Increase the driver memory spark.driver.memory and ensure that spark.yarn.driver.memoryOverhead is at least 20% that of the driver memory.

Spark applications stay alive forever


Cluster resources are consumed by Spark applications.


This can occur if you run multiple Hive on Spark sessions concurrently.


Manually terminate the Hive on Spark applications:
  1. Go to the YARN service.
  2. Click the Applications tab.
  3. In the row containing the Hive on Spark application, select > Kill.