Hive query isolation for scan-heavy, data-intensive queries on public clouds

This topic describes how auto-scaling works for scan-heavy, data-intensive queries on Cloudera Data Warehouse (CDW) Public Cloud.

In CDW Public Cloud service, Hive Virtual Warehouses base auto-scaling on total scan size of the query. HiveServer, which receives all incoming queries, has a query planner component. When the HiveServer query planner receives queries, it examines the total scan size of each query. That is, it looks at the number of bytes read from the file system required to execute the query. If the Query Isolation feature has been enabled for a Virtual Warehouse and a query scans more than the threshold set in the hive.query.isolation.scan.size.threshold parameter, the planner runs the query in isolation. This means that an isolated standalone executor group is spawned to run the data-intensive query.

The default setting for this parameter is the data cache size (200GB) times the T-shirt size set for the Virtual Warehouse. T-shirt size is the XSMALL, SMALL, MEDIUM or LARGE sized executor group you configure for the Virtual Warehouse.

Concurrency auto-scaling, auto-suspending, and auto-resume features of the core executor groups are not impacted by any isolated queries that are running.

Scan-heavy, data-intensive queries are not scheduled in the XSMALL, SMALL, MEDIUM, or LARGE-sized executor groups you configure for your Virtual Warehouse. Instead the Hive Virtual Warehouse spawns a new dedicated executor group with the right number of executors that are required to limit query execution time to acceptable targets. This isolated executor group is only used for the single scan-heavy, data-intensive query, and it terminates after the query completes, ensuring there are no wasted resources.

Example

If you have enabled Query Isolation on a MEDIUM-sized Virtual Warehouse which creates a 20-node core executor group capable of running 20 queries concurrently, the default scan size threshold is the T-shirt size times the data cache size, or:

20 x 200GB = 4TB

In this case, if any query scans more than 4TB of data, the query planner decides to run the query in isolation, estimating the number of executor nodes needed to run the query. The number of nodes spawned to run the query does not exceed the default setting for the hive.query.isolation.max.nodes.per.query parameter, which default to 2 times the T-shirt size. For this MEDIUM-sized Virtual Warehouse, which has a 20-node core executor group, 40 nodes (2 x 20) are spawned to run the isolated query.

When the isolated query is running, if HiveServer receives another query, which scans less than 4TB of data, the planner runs that query in the core executor group and does not spawn an isolated executor group to run it. If there is capacity in the core executor group, the query runs immediately or concurrency auto-scaling provisions more capacity in the core executor group.

Assume the original data-intensive "scan-heavy" isolated query is still running. If HiveServer receives an additional "scan-heavy" query, which scans more than 4TB of data, another isolated executor group is spawned to run the additional query. However, if two additional "scan-heavy" queries are received, unless the default value (2) is changed for the hive.query.isolation.max.queries parameter, one scan-heavy query runs, while the other is queued until one of the other isolated queries has finished executing.

Query isolation process for data-intensive queries

Optimize your Virtual Warehouse using the controls described in Hive concurrency auto-scaling-for standard BI-type queries, enable Query Isolation, and set a value for the hive.query.isolation.scan.size.threshold configuration parameter. When the query planner encounters a scan-heavy, data-intensive query whose scan size exceeds the value set for the threshold, it launches the query isolation feature. Query isolation automatically spawns an “on-demand” executor group with the estimated correct number of executors for the scan-heavy query :



  1. When users connect to Hive Virtual Warehouses using SQL applications such as Hue, Data Analytics Studio (DAS), or other SQL clients that use JDBC or ODBC, the query is handled by HiveServer. First HiveServer generates a preliminary query execution plan that does not include distributing the query tasks across executor nodes. The query planner also determines the data scan size of the query.
  2. For standard BI-type queries with smaller scan sizes: HiveServer locates an available query coordinator in the Virtual Warehouse to handle the query. The full process for these queries is described in Hive auto-scaling-for standard BI-type queries.

  3. For data-intensive queries with larger scan sizes: You can enable the query isolation feature and specify a scan size threshold when you are setting up your Virtual Warehouse. Then the query isolation feature spawns an on-demand executor group with the estimated correct number of executors to handle the single query. This executor group is limited in size by the values you set for the controls in the Data Warehouse UI:



    Use the following settings to control resource usage and costs:
    • Max Concurrent Isolated Queries sets the maximum number of queries that can spawn dedicated executor groups at one time. For example, if Max Concurrent Isolated Queries is set to 3 and a dedicated executor group is spawned for each data-intensive query, only 3 dedicated executor groups can be running at one time. If another data-intensive query is received, it must wait in a queue to execute.
    • Max Nodes Per Isolated Query sets the maximum number of nodes that can be created for each data-intensive query.
  4. The scan size threshold for triggering the query isolation feature can be set with the hive.query.isolation.scan.size.threshold configuration parameter:



    1. In the Data Warehouse service UI, select Virtual Warehouses in the left menu to see a table containing all of your Virtual Warehouses, and locate the warehouse you want to set the scan size threshold for.
    2. In the detail page for the specific Virtual Warehouse, select the Hiveserver2 tab.
    3. On the Hiveserver2 tab, search on the word "isolation" in the Search Settings text box to locate the hive.query.isolation.scan.size.threshold configuration parameter.
    4. Set the value. Specify a data size in storage units. For example: "500GB" or "2TB"
    5. Click Apply.

After setting a value for this parameter, when the threshold is reached, the CDW service spawns dedicated executor groups to isolate scan-heavy, data-intensive queries and terminates them as soon as the query finishes executing.