Tuning Hive Virtual Warehouses on public clouds

In Cloudera Data Warehouse (CDW) service, when you tune Hive-LLAP Virtual Warehouses, you set the auto-suspend timeout, the minimum and maximum number of nodes for your virtual cluster, when your cluster should scale up and down, and whether to use query isolation for scan-heavy, data-intensive queries.

Required role: DWAdmin

  1. Log in to the CDP web interface and navigate to the CDW service.
  2. In the CDW service, navigate to the Overview page.
  3. On the Overview page under Virtual Warehouses, click the edit icon for a warehouse:


  4. The next page provides properties that you can adjust to tune auto-scaling for your Virtual Warehouse:
    1. Set the AutoSuspend Timeout, which determines how many seconds the warehouse cluster is idle before it suspends itself:


      This setting helps to ensure you are not wasting idle resources.

    2. Set the minimum and maximum number of nodes that the cluster can contain:


      Use the minimum number of nodes setting to ensure that your queries always have resources and use the maximum number of nodes setting to contain your costs. Decide the minimum and maximum number of nodes based on your workloads similarly to how you determine node counts for your on-premises clusters. Consider the number of concurrent queries, the complexity of queries, the amount of data scanned by the queries, and the number of queries in your workloads to determine the appropriate number of nodes to set on each Virtual Warehouse instance.

    3. Choose when your cluster auto-scales up based on either the HEADROOM or the WAIT TIME settings:
      • Set the HEADROOM, which sets the number of nodes to keep free to handle increases in numbers of queries. For example, you might configure a Small-sized Hive-LLAP Virtual Warehouse with 10 nodes to accommodate a peak workload of 60 concurrent users, but a minimum of 5 concurrent users. The system can auto-scale up to support 60 concurrent users to meet the peak demand. In this case, you leave the default setting for HEADROOM Desired Free Capacity at 1:


        When HEADROOM is set to 1 for a 10-node cluster, then if there are over 9 queries concurrently executing, the cluster auto-scales up without waiting for 10 queries to fill the cluster.

      • Alternatively, you can also configure your cluster to auto-scale based on query wait times. The WAIT TIME auto-scaling configuration option sets how long queries wait in the queue to execute before the cluster auto-scales up. For example, if WaitTime Seconds is set to 10, then when executing queries are waiting in the queue for 10 seconds, the cluster auto-scales up to meet query demand:


        However, this is a less predictable method because the scaling might react to non-scalable factors to spin up clusters. For example, query wait times might increase because of inefficient queries and not because of query volume.

    4. Select Query Isolation if you have scan-heavy, data-intensive queries in your workloads:


      Query Isolation enables your Virtual Warehouse to determine, based on the value you set for the hive.query.isolation.scan.size.threshold configuration parameter, whether to spawn dedicated nodes to execute scan-heavy, data-intensive queries.

      You can set this parameter in the Virtual Warehouse details page for the warehouse:

      1. In the CDW service UI, click Virtual Warehouses in the left menu.
      2. In the list of warehouses, select the name of the Virtual Warehouse you want to configure this parameter for.
      3. On the Virtual Warehouse details page, click the Hiveserver2 tab.
      4. Type isolation in the Search Settings text box to locate the parameter.
      5. In the VALUE text box for the hive.query.isolation.scan.size.threshold parameter, enter the amount of data for your threshold in storage units. For example, 400GB.
      6. Click Apply in the upper right corner of the page to save your settings.

      After you select Query Isolation, two more configuration options appear:

      • Max Concurrent Isolated Queries: Sets the maximum number of isolated queries that can execute concurrently in their own dedicated executor nodes. Select this number based on the scan size of the data for your average scan-heavy, data-intensive query.
      • Max Nodes Per Isolated Query: Sets how many executor nodes can be spawned for each isolated query.
    5. Click Apply in the upper right corner of the page to apply your changes.