Managing costs in the public cloud environments for Cloudera Data Warehouse

Cost optimization in cloud environments is top priority for enterprises. Eighty percent of cloud costs are determined by the number of compute instances you use. Compute instances can be virtual machines or containers. Cloudera Data Warehouse (CDW) Public Cloud and your cloud provider give you ways to monitor and control the cloud resources you use.

Setting resource limits with Cloudera Data Warehouse service

Cloudera Data Warehouse service provides the following ways to manage your cloud costs:

  • Choose Virtual Warehouse size: Virtual Warehouse size specifies the number of executor nodes used by the Virtual Warehouse, which translates to compute instances. Before you create a Virtual Warehouse, determine the number of concurrent queries or users your Virtual Warehouse must serve during peak periods. This information helps you determine what size of Virtual Warehouse you need. Choose the size based on the number of nodes you typically use for clusters in an on-premises deployment.
  • Set auto-scaling thresholds: When you create a Virtual Warehouse, you can define auto-scaling, which sets limits on how many cloud resources can be consumed to meet workload demands. In addition, you can also set the maximum time a Virtual Warehouse idles before shutting down. Both settings ensure that you only use cloud resources that you need when you need them, helping you to manage your costs in the cloud.

Setting resource limits with your cloud provider

Another way to manage your cloud costs is by setting effective resource limits with your cloud provider. Cloud providers offer ways to set the overall limit on numbers of virtual machines or containers that can be used for your account. You can also save on cloud expenses by shutting down resources when they are not in use. CDP has built-in functionality to shut down cloud resources when not in use. If necessary, you can also terminate resources to save on costs related to disk snapshots, reserved IP addresses, and so on. In addition, cloud providers might offer tools to help you save. For example, AWS offers AWS Trusted Advisor and CloudWatch. Microsoft Azure offers Azure Cost Management and Azure Advisor.

Always active, shared services

Several shared service nodes are always active in a Cloudera Data Warehouse environment. A number of infrastructure pods are required at all times, and sufficient nodes must be deployed for these pods. The following table lists actions and the impact of those actions on shared service nodes.
Action Impact on Shared Services Nodes Comments
Activating a Cloudera Data Warehouse environment Yes Starts with minimum 3 nodes
Creating Database Catalog Maybe It depends on the pod placement logic
Creating Impala/Hive Virtual Warehouse Yes This also adds more shared service pods to support compute nodes (weavenet, kube-proxy etc.)
Deleting Impala/Hive Virtual Warehouse Maybe

It depends on the pod placement logic.

“Cooldown” period also plays a role.

Compute autoscaling Yes Autoscaling compute nodes will add hue, hs2 pods.
The following scaling factors also affect shared services nodes.
  • The number of Virtual Warehouses you run increases the required shared service resources.
  • The T-shirt you set for Virtual Warehouses and autoscaling on executors has impact as shown in the following example:

    Hue instance (part of shared service) count = (query-executors/10) + 1

    => from 2 to 10 query executors then 1 Hue server Pod instance,

    => from 10 to 19 query executors then 2 Hue server Pod instances

    => from 20 to 29 query executors then 3 Hue server Pod instances