Configure Hive Virtual Warehouses to cache UDF JAR files in HiveServer2

After you write and compile your User Defined Function (UDF) code into a Java Archive (JAR) file, you can configure Hive Virtual Warehouses to cache the UDF JAR in HiveServer2 in Cloudera Data Warehouse (CDW) running on AWS environments.

UDFs enable you to create custom functions to process records or groups of records. Although Hive provides a comprehensive library of functions, there are gaps for which UDFs are a good solution.

After you have finished the prerequisite tasks listed in Before you begin, use the steps listed below to configure the Hive Virtual Warehouse to cache the JAR file for quick access by the Virtual Warehouse. After configuring the warehouse for caching, the UDF JAR is downloaded from the S3 location the first time it is called and then it is cached in HiveServer2. After it is cached, all succeeding calls to the JAR are accessed from the HiveServer2 cache.

Configuring caching significantly improves performance for queries that use these UDFs. Without caching, loading a very large UDF of several hundred MBs from S3 can take up to several minutes for each query. When you use caching, you can avoid that.

Required role: DWAdmin

  • You must have already written, compiled, and exported your UDF code to a JAR file before you can perform this task.
  • Upload the UDF JAR file to a managed or external bucket in your CDW cluster that was created in your AWS account for your Hive Virtual Warehouse where you want to use the UDF. For more information about these buckets, see Adding access to external S3 buckets for CDW.
  1. In the CDW UI on the Overview page, locate the Hive Virtual Warehouse that uses the S3 buckets where you placed the UDF JAR file, and click the edit icon to launch the details page.
  2. In the Virtual Warehouse details page, make sure the CONFIGURATIONS tab is selected, and click the HiveServer2 sub tab.
  3. On the HiveServer2 sub tab, select hive-site from the drop-down list, and click the plus sign to launch the Add Custom Configurations dialog box.
  4. In the Add Custom Configurations dialog box, add the following configuration information, and then click ADD:
    hive.server2.udf.cache.enabled = true
  5. Click APPLY in the upper right corner of the page.
  6. Verify that the configuration property and setting have been added by searching for hive.server2.udf.cache.enabled in the search box. If the property has been added, the property name should be displayed in the KEY column of the table.