Configuring Impala Virtual Warehouse
Choosing the right size of your CDW environment before migrating your workload from CDH to CDW Private Cloud on CDP is critical for preserving performance characteristics. There are several factors from your query workload to consider when choosing the capacity for your environment.
Cloudera provides you an automated script, impala_sizing_script.py, that can be run from Cloudera Manager. The script analyzes your current workload and will provide you with information about sizing requirements for the Virtual Warehouse you will be creating. Use the information as guidelines when you decide on the sizing for your VWs.
- Ensure that the warehouse is being actively managed by Cloudera manager.
- Identify target workload. Queries can be filtered using date/time range and resource pools after execution. Otherwise, all queries are factored into the resource sizing.
- Understand when workload is executed and noise factors such as ad-hoc queries.
- Configure conf file for impala_query_sizing.py
Example conf file
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Running the script generates a similar output as shown here:
$ python impala_query_sizing.py sizing.conf
Config file: sizing.conf
Individual Query Analysis
Total Queries: 243248
Total Query Time: 4563498.64 sec
Highest Resources Query ID: f24d23f2e51c1896:faafd7e600000000
Max Nodes: 224
Max Cores Per Node: 25.22
Max Data Per Node: 120.31 GB
Max Spill Per Node: 871.73 GB
Max Memory Per Node: 15.96 GB/s
Max Data Rate: 1.16 GB
Pools:
root.default
Concurrent Query Analysis
Max Concurrent Queries: 47
Max Concurrent Resources Time: 2021-08-04T15:45:42Z430
Max Concurrent Cores Per Node: 28.04
Max Concurrent Data Per Node: 242.28 GB
Max Concurrent Spill Per Node: 1100.14 GB
Max Concurrent Memory Per Node: 32.02 GB
Max Concurrent Data Rate: 2.0 GB/s
Cluster Sizing
Size Min Pods Max Pods Constrained By
CUSTOM 249 257 cpu spill
Query Counts
Cache Mem CPU Spill
Size Count Constrained Constrained Constrained Constrained
XSMALL 219188 243241 240795 219976 243188
SMALL 20695 7 2451 19918 51
MEDIUM 2754 0 0 2748 6
LARGE 571 0 2 567 2
CUSTOM 40 0 0 39 1
Average Cluster Utilization
Cache Memory CPU Spill
0.02 % 0.13 % 0.47 % 0.15 %
Use the generated output as guidelines for sizing.
- Individual Query Analysis reflects maximum resource usage of individual queries. This is used to derive the Min Pods size which represents the base size for the VW.
- Concurrent Query Analysis reflects maximum resource usage when queries overlap. This is used to derive the Max Pods size which represents the autoscale limit for the VW. VWs scale up in units of the base size and this can lead to significant additional node requirements for large VWs. In the example above, it may make more sense to increase the base size of the VW to the maximum size rather than using scaling as that would require 2x the Min size nodes.
- The Query Counts matrix provides an additional breakdown of query resource requirements that can be helpful to understand how many queries and which resource dimensions are driving up cluster size. In some cases, it may be possible to tune problematic queries, isolate them from the workload, or accept a longer SLA so that a smaller VW size can be used.
- Average Cluster Utilization reflects the total resource across the workload. Utilization can be quite low for mixed workloads when a small number of queries is driving up cluster size.