Migrating Impala Workloads to Cloudera Private CloudPDF version

Workload selection

Depending on the size (CPU, Memory, Cache) of your Cloudera Data Warehouse environment, you must choose which workloads should be migrated from base Impala to Cloudera Data Warehouse Impala.

If you have teams of data analysts using Business Intelligence (BI) tools and applications, then you can migrate to Cloudera Data Warehouse Data Service on premises to benefit from data and result caching. If you have repeated BI workloads, see Guidelines for moving repeated BI workloads.

New features are developed and released in Cloudera Data Warehouse first. If your user base needs to use newer technologies such as Iceberg, Unified Analytics, Data and Result Caching, and so on, then you can consider moving your workloads to Cloudera Data Warehouse Data Service on premises.

If your teams are running complex and large queries that require high memory and run for a long time that cause other SLA-driven workloads to be impacted, then you can isolate them out from the Impala service on the base cluster and run them from dedicated Impala Virtual Warehouses to achieve compute isolation.

If your workloads depend on streaming datasets that are inserted, updated, or deleted frequently, then you must configure Kudu on the base cluster and you can use Impala to query Kudu on base. Cloudera Data Warehouse supports creating Impala tables in Kudu. For more information, see Configuring Impala Virtual Warehouses to create Impala tables in Kudu in Cloudera Data Warehouse on premises.

If your workloads scan extremely large datasets, then you need to consider data locality before moving the workloads from the Impala service on the base cluster to Impala in Cloudera Data Warehouse Data Service. Scanning large data sets over the network can take time. If you plan to move such workloads to Cloudera Data Warehouse, then ensure that the data can be cached for better performance. You can also increase the auto-suspend time for the Impala Virtual Warehouse to make sure that the cache is retained.

Other things that you must consider while assessing the workloads are tenant modeling and sizing followed by performance testing. This ensures that queries return correct results and that performance is acceptable.