Fast autoscaling in COD

Cloudera Operational Database (COD) supports fast autoscaling for higher computing requirements. COD enables fast autoscaling whenever high CPU utilization or higher RPC latency is observed in the system.

The fast autoscaling is achieved using the following mechanisms.

Introduction to Compute Instance Group

A Compute node instance group in COD consists of nodes that host all the services, similar to those on Worker nodes, except the HDFS data services. The region servers on these nodes communicate with HDFS (data nodes) running on the Worker nodes.

The Compute nodes add to the higher compute capacity in the system. During scale-up operations due to higher compute requirements, the Compute nodes are scaled up using the same storage as on the existing Worker nodes. Hence, during the scale-down of these Compute nodes, the data remains in the Worker nodes and does not require data redistribution. The Worker nodes continue to host the data services and are scaled (up or down) according to the data requirements (such as Data storage and Regions per server).

Figure 1. COD Architecture

Metrics categorization

The metrics used for autoscaling are categorized into the following groups. The data metrics are used for the scaling of the Worker instance group while the compute metrics are used for the scaling of the Compute instance group.

  • Worker Group (Data-related metrics)
    1. HDFS usage
    2. Region density (Regions per server)
    3. Total Store File size across region servers
  • Compute Group (Compute-related metrics)
    1. CPU utilization
    2. RPC latency

Fast autoscaling mechanism

COD relies on a mechanism of suspending and resuming (start and stop) autoscaling, where the maximum allowed number of nodes in the system are instantiated but kept in a Stopped state. This mechanism is used for the scaling of the Compute instance group.

When a requirement for higher compute power arises in the cluster, COD starts the required number of Compute nodes. The scale-up time is drastically reduced because the Compute nodes are already initialized. Similarly, when the requirement for computing goes down, the Compute nodes are stopped but not dropped. They remain in the cluster in the Stopped state. The scale-up and scale-down time is reduced since the instance initialization is skipped during every scale-up or scale-down operation.

COD supports the suspending and resuming of fast autoscaling for the Compute nodes only. The Worker nodes continue to use the existing autoscaling mechanism.