Tuning Hive Virtual Warehouses on private clouds

This topic describes how to tune Hive Virtual Warehouses in Cloudera Data Warehouse (CDW) Private Cloud.

When you tune Hive Virtual Warehouses, you set the auto-suspend timeout, the minimum and maximum number of nodes for your virtual cluster, when your cluster should scale up, and when it should scale down.

  1. Log in to the CDP web interface and navigate to the Data Warehouse service.
  2. In the Data Warehouse service, click Overview in the left navigation pane.
  3. In the Overview page under Virtual Warehouses, click

    against the required virtual warehouse and select Edit.
  4. Click the SIZING AND SCALING tab to view the properties that you can adjust to tune auto-scaling for your data warehouse:
    1. Set the AutoSuspend Timeout property under Auto scale, which determines how many seconds the warehouse cluster is idle before it suspends itself:

      This setting helps to ensure performance is not impacted by having idle resources.

    2. Set the minimum and maximum number of nodes that the cluster can contain:

      Use the minimum number of nodes setting to ensure that your workloads always have resources and use the maximum number of nodes setting to contain having too many idle resources. Decide the minimum and maximum number of nodes based on your workloads similarly to how you determine node counts for your on-premises clusters. Consider the number of concurrent queries, the complexity of queries, and the volume of queries in your workloads to determine the appropriate number of nodes to set on each Virtual Warehouse instance.

    3. Choose when your cluster auto-scales up based on the WAIT TIME setting, which sets how long queries wait in the queue to run before the cluster auto-scales up. For example, if WaitTime Seconds is set to 10, then when executing queries are waiting in the queue for 10 seconds, the cluster auto-scales up to meet query demand.
    4. Select Query Isolation if you have scan-heavy, data-intensive queries in your workloads.
      Query Isolation enables your Virtual Warehouse to determine, based on the value you set for the hive.query.isolation.scan.size.threshold configuration parameter, whether to spawn dedicated nodes to run scan-heavy, data-intensive queries.

      You can set this threshold parameter in the Virtual Warehouse details page for the warehouse:

      1. In the Data Warehouse service UI, click Virtual Warehouses in the left navigation pane.
      2. From the list of warehouses, click the Virtual Warehouse you want to configure this parameter for.
      3. In the Virtual Warehouse details page, click CONFIGURATIONS > Hiveserver2.
      4. Select hive-site from the Configuration files drop-down list and type isolation in the search text box to locate the parameter.
      5. In the VALUE text box for the hive.query.isolation.scan.size.threshold parameter, enter the amount of data for your threshold in storage units. For example, 400GB.
      6. Click APPLY to save your settings.

      After you enable Query Isolation, two more configuration options appear:

      • Max Concurrent Isolated Queries: Sets the maximum number of isolated queries that can run concurrently in their own dedicated executor nodes. Select this number based on the scan size of the data for your average scan-heavy, data-intensive query.
      • Max Nodes Per Isolated Query: Sets how many executor nodes can be spawned for each isolated query.
    5. Click APPLY.