Hive query isolation ETL jobs
You can configure the SQL engine to isolate queries that involve lengthy scanning. Isolating these queries dedicates compute resources to handle these scans, leaving other resources free to handle less intensive jobs, such as ad-hoc queries.
If you enable Query Isolation in a Hive Virtual Warehouse, any query that exceeds the
hive.query.isolation.scan.size.threshold
value runs in isolation. An
isolated standalone executor group is spawned to run the data-intensive query. Concurrency
auto-scaling, auto-suspending, and auto-resume features of the core executor groups are not
impacted by any isolated queries that are running.
Enabling query isolation
While creating or editing a Hive Virtual Warehouse, you can enable or disable query isolation and configure parameters, such as the maximum concurrent isolated queries and maximum nodes per isolated query. These parameters allow you to control resource usage and costs.
- Max Concurrent Isolated Queries
(
hive.query.isolation.max.queries
) - Specifies the maximum number of queries that can spawn dedicated executor groups at one time. For example, if Max Concurrent Isolated Queries is set to 3 and a dedicated executor group is spawned for each data-intensive query, only 3 dedicated executor groups can be running at one time. If another data-intensive query is received, it must wait in a queue to run.
- Max Nodes Per Isolated Query
(
hive.query.isolation.max.nodes.per.query
) -
Specifies the maximum number of executors that can be created for each data-intensive query.
Determining the default scan size threshold
To determine the default scan size threshold
(hive.query.isolation.scan.size.threshold
), you need to calculate a
threshold based on your Virtual Warehouse T-shirt size, and then set the default scan size
parameter accordingly.
Multiply the executors in your Virtual Warehouse core executor group by the default data cache size. For example, for a MEDIUM-sized Virtual Warehouse, which has 20 executors, you set the scan size threshold to 20 x 200GB = 4TB.
Any query that scans more than 4TB of data runs in isolation, the number of executors
spawned to run the query does not exceed the default setting for the
hive.query.isolation.max.nodes.per.query
parameter, which defaults to 2
times the T-shirt size. For a MEDIUM-sized Virtual Warehouse, which has 20 executors, 40
executors (2 x 20) are spawned to run the isolated query.
When the isolated query is running, if HiveServer receives another query, which scans less than 4TB of data, the planner runs that query in the core executor group and does not spawn an isolated executor group to run it. If there is capacity in the core executor group, the query runs immediately or concurrency auto-scaling provisions more capacity in the core executor group.
Assuming the original data-intensive "scan-heavy" isolated query is still running, if
HiveServer receives an additional "scan-heavy" query, which scans more than 4TB of data,
another isolated executor group is spawned to run the additional query. However, if two
additional "scan-heavy" queries are received, unless the default value (2) is changed for
the hive.query.isolation.max.queries
parameter, one scan-heavy query runs,
while the other is queued until one of the other isolated queries has finished
executing.
For more information about setting the default scan size threshold, see Tuning Hive Virtual Warehouses.
Query isolation process
After you have specified the default scan size threshold, optimize your Virtual Warehouse using the controls described in Hive auto-scaling and enable query isolation for the Virtual Warehouse. When the query planner encounters a scan-heavy, data-intensive query whose scan size exceeds the value set for the threshold, it launches the query isolation feature. Query isolation automatically spawns an “on-demand” executor group with the estimated correct number of executors for the scan-heavy query.
- For standard BI-type queries with smaller scan sizes: HiveServer locates an available query coordinator in the Virtual Warehouse to handle the query.
- For data-intensive queries with larger scan sizes: You can enable the query isolation feature and specify a scan size threshold when you are setting up your Virtual Warehouse. The query isolation feature spawns an on-demand executor group with the estimated correct number of executors to handle the single query. The executor group is limited in size by the values you specify for the query isolation parameters.