Configuring Impala Virtual Warehouse

Choosing the right size of your CDW environment before migrating your workload from CDH to CDW Private Cloud on CDP is critical for preserving performance characteristics. There are several factors from your query workload to consider when choosing the capacity for your environment.

Cloudera provides you an automated script, impala_sizing_script.py, that can be run from Cloudera Manager. The script analyzes your current workload and will provide you with information about sizing requirements for the Virtual Warehouse you will be creating. Use the information as guidelines when you decide on the sizing for your VWs.

  • Ensure that the warehouse is being actively managed by Cloudera manager.
  • Identify target workload. Queries can be filtered using date/time range and resource pools after execution. Otherwise, all queries are factored into the resource sizing.
  • Understand when workload is executed and noise factors such as ad-hoc queries.
  • Configure conf file for impala_query_sizing.py

Example conf file

cm_url=https"//my_cdw_pvc_cm:7183
user_name=default
pfile=auth.prop
CM API URL
Username for CM API
Base-64 encrypted CM password
cluster_name=MY_CDW_PVC
from=2021-07-12T00:00:00z
to=2021-08-12T00:00:00z
pool=my_tenant
Selection criteria for CM API
mem_adjustment_pct=100
cache_adjustment_pct=100
cpu_adjustment_pct=80
scratch_adjustment_pct=100
Adjustment to resource dimensions
scratch_gb_per_node=1000
cache_gb_per_node=1000
query_mem_per_node=200
vcores_per_node=16
PVC executor pod configuration
mt_dop=12
Parallel modeling assumption
skip_query_file=skipped_query.out
Output CSV for queries that don’t have profile data
pod_limit=100
prune_output_file=pruned_queries.csv
Size limit to prune queries to separate file
Output CSV for pruned queries

output_file=impala_sizing.csv

input_file=impala_sizing_30day.csv

Output CSV if querying CM
Input CSV from previous run
Steps to use the script
  1. Download the script from the following repository.
  2. From the command line, run the impala_query_sizing.py script.

Running the script generates a similar output as shown here:


                $ python impala_query_sizing.py sizing.conf
                
                Config file: sizing.conf
                
                Individual Query Analysis
                  Total Queries: 243248
                  Total Query Time: 4563498.64 sec
                  Highest Resources Query ID: f24d23f2e51c1896:faafd7e600000000
                  Max Nodes: 224
                  Max Cores Per Node: 25.22
                  Max Data Per Node: 120.31 GB
                  Max Spill Per Node: 871.73 GB
                  Max Memory Per Node: 15.96 GB/s
                  Max Data Rate: 1.16 GB
                  Pools:
                     root.default
                
                Concurrent Query Analysis
                  Max Concurrent Queries: 47
                  Max Concurrent Resources Time: 2021-08-04T15:45:42Z430
                  Max Concurrent Cores Per Node: 28.04
                  Max Concurrent Data Per Node: 242.28 GB
                  Max Concurrent Spill Per Node: 1100.14 GB
                  Max Concurrent Memory Per Node: 32.02 GB
                  Max Concurrent Data Rate: 2.0 GB/s
                
                Cluster Sizing
                Size		Min Pods	Max Pods	   Constrained By
                CUSTOM 		249 	      257 		cpu spill 
                
                Query Counts
                                         Cache       Mem         CPU         Spill
                Size     Count       Constrained Constrained Constrained Constrained
                XSMALL      219188      243241      240795      219976      243188
                SMALL        20695           7        2451       19918          51
                MEDIUM        2754           0           0        2748           6
                LARGE          571           0           2         567           2
                CUSTOM          40           0           0          39           1
                
                Average Cluster Utilization
                Cache    Memory    CPU       Spill
                0.02 %    0.13 %    0.47 %    0.15 %
                
            

Use the generated output as guidelines for sizing.

  • Individual Query Analysis reflects maximum resource usage of individual queries. This is used to derive the Min Pods size which represents the base size for the VW.
  • Concurrent Query Analysis reflects maximum resource usage when queries overlap. This is used to derive the Max Pods size which represents the autoscale limit for the VW. VWs scale up in units of the base size and this can lead to significant additional node requirements for large VWs. In the example above, it may make more sense to increase the base size of the VW to the maximum size rather than using scaling as that would require 2x the Min size nodes.
  • The Query Counts matrix provides an additional breakdown of query resource requirements that can be helpful to understand how many queries and which resource dimensions are driving up cluster size. In some cases, it may be possible to tune problematic queries, isolate them from the workload, or accept a longer SLA so that a smaller VW size can be used.
  • Average Cluster Utilization reflects the total resource across the workload. Utilization can be quite low for mixed workloads when a small number of queries is driving up cluster size.