Planning a CDW Virtual Warehouse instance

A Hive Virtual Warehouse is an instance of compute resources with various options that allow you to control the size, elasticity, and availability of the data warehouse to meet your varying workload demands. Therefore, it is essential to understand the type of workloads or number of concurrent queries that your Virtual Warehouse must serve during peak periods before deciding the size of your Virtual Warehouse.

If you are upgrading from legacy platforms or CDP to CDW PvC and migrating workloads that used to run on YARN, you may want to consider building your Virtual Warehouse instance as a copy of the compute resources that you have defined in the YARN queues. If these YARN queues are sized based on workload types, such as BI or ETL, you may use this document as a reference to plan your Virtual Warehouse instance.

In CDW, a Virtual Warehouse is categorized based on sizes that represent the size of an Executor Group that handles query requests. Queries can only run within the boundaries of an Executor Group.

Mixing workloads with different characteristics in the same Executor Group makes it difficult to predict Service Level Agreements (SLAs). Therefore, it is important that you understand your workloads and plan your Virtual Warehouses based on the workload types.

  • Business Intelligence (BI) workloads
    Business Intelligence (BI) workloads are usually smaller with more targeted datasets. Response times need to be optimized because operations depend on near real-time analysis.
  • Research BI workloads
    Research BI workloads dive deeper into data exploration than BI workloads. Datasets used in research may not always be optimized for performance. In Research BI workloads, the balance can tilt more towards the cost side of the equation than that of BI workloads.
  • Discovery ETL workloads
    Discovery ETL workloads are unpredictable and take “Research BI” into longer-running exploratory pipelines that might yield desired results.
  • Production ETL workloads
    Production ETL workloads are predictable and usually come with SLAs. They build out the data models used by BI and Research that run the business. They can also be very resource intensive and traverse through a lot of data.

You must create a Virtual Warehouse instance with these workload characteristics in mind. Data Services, where Virtual Warehouse instances are created, provide isolation for other Virtual Warehouses or Data Service instances running workloads. However, there is no isolation within an Executor Group of a Virtual Warehouse instance.

The following table lists the Virtual Warehouse sizes that you can choose based on the workload types that are described above:
XSMALL (2 executors) SMALL (10 executors) MEDIUM (20 executors) LARGE (40 executors) Custom
Research BI X X
Discovery ETL X X X
Production ETL X1 X1 X1
1 Consider the Query Isolation mode for Virtual Warehouse instances that run complex ETL-type queries requiring intensive data scanning.