Configuring Hive query isolation auto-scaling

You can configure your Hive Virtual Warehouse to add dedicated executors to run scan-heavy, data-intensive queries, also known as ETL queries. You learn how to enable a Virtual Warehouse auto-scaling feature and set a query isolation parameter in the Hive configuration.

  • You are familiar with the auto-scaling process.
  • You are creating a Hive Virtual Warehouse for running ETL-type queries.
  • In Cloudera Data Warehouse, you added a Hive Virtual Warehouse, configured the size of the Hive Virtual Warehouse, and configured auto-suspend as described in previous topics.
  • You obtained the DWAdmin role.
In this task, first you configure the same auto-scaling properties as described in the previous topic "Configuring Hive VW concurrency auto-scaling", and then you enable query isolation. Next, you set the hive.query.isolation.scan.size.threshold configuration parameter.
  1. In Concurrency Auto-scaling properties, in Executors: Min: , Max:, limit auto-scaling by setting the minimum and maximum number of executor nodes that can be added.
  2. Choose when your cluster auto-scales up based on one of the following choices:
  3. In Concurrency Autoscaling, enable Query Isolation.
    For example:
    Two configuration options appear.
  4. In Max Concurrent Isolated Queries, set the maximum number of isolated queries that can run concurrently in their own dedicated executor nodes.
    Select this number based on the scan size of the data for your average scan-heavy, data-intensive query.
  5. In Max Nodes Per Isolated Query, set how many executor nodes can be spawned for each isolated query.
  6. Click CREATE.
  7. In the Data Warehouse service UI, click Virtual Warehouses, Options CONFIGURATIONS > Hiveserver2 .
  8. In Configuration files, select hive-site and set hive.query.isolation.scan.size.threshold parameter to limit how much data can be scanned.
    For example, set 400GB.
  9. Click APPLY.