Hive query isolation ETL jobs

You can configure the SQL engine to isolate queries that involve lengthy scanning. Isolating these queries dedicates compute resources to handle these scans, leaving other resources free to handle less intensive jobs, such as ad-hoc queries.

If you enable Query Isolation in a Hive Virtual Warehouse, any query that exceeds the hive.query.isolation.scan.size.threshold value runs in isolation. An isolated standalone executor group is spawned to run the data-intensive query. Concurrency auto-scaling, auto-suspending, and auto-resume features of the core executor groups are not impacted by any isolated queries that are running.

Enabling query isolation

While creating or editing a Hive Virtual Warehouse, you can enable or disable query isolation and configure parameters, such as the maximum concurrent isolated queries and maximum nodes per isolated query. These parameters allow you to control resource usage and costs.

Enabling or disabling query isolation for Hive Virtual Warehouses

Max Concurrent Isolated Queries (hive.query.isolation.max.queries): Specifies the maximum number of queries that can spawn dedicated executor groups at one time. For example, if Max Concurrent Isolated Queries is set to 3 and a dedicated executor group is spawned for each data-intensive query, only 3 dedicated executor groups can be running at one time. If another data-intensive query is received, it must wait in a queue to run.
Max Nodes Per Isolated Query (hive.query.isolation.max.nodes.per.query): Specifies the maximum number of executors that can be created for each data-intensive query.

Determining the default scan size threshold

To determine the default scan size threshold (hive.query.isolation.scan.size.threshold), you need to calculate a threshold based on your Virtual Warehouse T-shirt size, and then set the default scan size parameter accordingly.

Multiply the executors in your Virtual Warehouse core executor group by the default data cache size. For example, for a MEDIUM-sized Virtual Warehouse, which has 20 executors, you set the scan size threshold to 20 x 200GB = 4TB.

Any query that scans more than 4TB of data runs in isolation, the number of executors spawned to run the query does not exceed the default setting for the hive.query.isolation.max.nodes.per.query parameter, which defaults to 2 times the T-shirt size. For a MEDIUM-sized Virtual Warehouse, which has 20 executors, 40 executors (2 x 20) are spawned to run the isolated query.

When the isolated query is running, if HiveServer receives another query, which scans less than 4TB of data, the planner runs that query in the core executor group and does not spawn an isolated executor group to run it. If there is capacity in the core executor group, the query runs immediately or concurrency auto-scaling provisions more capacity in the core executor group.

Assuming the original data-intensive "scan-heavy" isolated query is still running, if HiveServer receives an additional "scan-heavy" query, which scans more than 4TB of data, another isolated executor group is spawned to run the additional query. However, if two additional "scan-heavy" queries are received, unless the default value (2) is changed for the hive.query.isolation.max.queries parameter, one scan-heavy query runs, while the other is queued until one of the other isolated queries has finished executing.

For more information about setting the default scan size threshold, see Tuning Hive Virtual Warehouses.

Query isolation process

After you have specified the default scan size threshold, optimize your Virtual Warehouse using the controls described in Hive auto-scaling and enable query isolation for the Virtual Warehouse. When the query planner encounters a scan-heavy, data-intensive query whose scan size exceeds the value set for the threshold, it launches the query isolation feature. Query isolation automatically spawns an “on-demand” executor group with the estimated correct number of executors for the scan-heavy query.

When users connect to Hive Virtual Warehouses using SQL applications such as Hue or other SQL clients that use JDBC or ODBC, the query is handled by HiveServer. First HiveServer generates a preliminary query execution plan that does not include distributing the query tasks across executors. The query planner also determines the data scan size of the query.

For standard BI-type queries with smaller scan sizes: HiveServer locates an available query coordinator in the Virtual Warehouse to handle the query.
For data-intensive queries with larger scan sizes: You can enable the query isolation feature and specify a scan size threshold when you are setting up your Virtual Warehouse. The query isolation feature spawns an on-demand executor group with the estimated correct number of executors to handle the single query. The executor group is limited in size by the values you specify for the query isolation parameters.