Configuring Hive query isolation auto-scaling

You can configure your Hive Virtual Warehouse to add dedicated executors to run scan-heavy, data-intensive queries, also known as ETL queries. You learn how to enable a Virtual Warehouse auto-scaling feature and set a query isolation parameter in the Hive configuration.

  • You are familiar with the auto-scaling process.
  • You are creating a Hive Virtual Warehouse for running ETL-type queries.
  • In Cloudera Data Warehouse, you added a Hive Virtual Warehouse, configured the size of the Hive Virtual Warehouse, and configured auto-suspend as described in previous topics.
  • You obtained the DWAdmin role.
In this task, first you configure the same auto-scaling properties as described in the previous topic Configuring Hive VW concurrency auto-scaling, and then you enable query isolation. Next, you set the hive.query.isolation.scan.size.threshold configuration parameter.
  1. In Concurrency Autoscaling properties, limit auto-scaling by setting the minimum and maximum number of executor nodes that can be added.
    Screenshot showing concurrency autoscaling settings for a Hive Virtual Warehouse.
  2. Choose when your cluster auto-scales up based on one of the following choices:
    • HEADROOM
    • WAIT TIME
  3. In Concurrency Autoscaling, select Enable Query Isolation.
    Two configuration options appear: Max Concurrent Isolated Queries and Max Nodes Per Isolated Query.
  4. In Max Concurrent Isolated Queries, set the maximum number of isolated queries that can run concurrently in their own dedicated executor nodes.
    Select this number based on the scan size of the data for your average scan-heavy, data-intensive query.
  5. In Max Nodes Per Isolated Query, set how many executor nodes can be spawned for each isolated query.
  6. Click Create Virtual Warehouse.
  7. In the Data Warehouse service Overview page, click Virtual Warehouses, and find your Virtual Warehouse. Click > Edit > Configurations > Hiveserver2.
  8. Select hive-site from the Configuration files drop-down menu and set hive.query.isolation.scan.size.threshold parameter to limit how much data can be scanned.
    For example, set 400GB.
  9. Click Apply Changes.