Managing multiple CPU and GPU workloads using Resource Groups

Learn how to use Resource Groups and Resource Profiles to provision and manage multiple CPU and GPU environments within your Cloudera AI workbench. This provides enhanced control over workload scheduling and allows for the segregation of workloads based on instance types.

This feature eliminates the previous limitation of having only one CPU and a maximum of one GPU instance group per workspace. You can now provision multiple groups of both CPU and GPU instances as needed.

Resource Group

Resource Group serves as a parent entity that defines the specific type of node (for example, m5.xlarge) where your workloads will execute. Resource Groups are provisioned using the Cloudera AI Control Plane, and every workbench requires at least one CPU resource group, which is identifiable by a unique name specified when creating the resource group. While multiple resource groups can share the same instance type, a single resource group maps to only one type, and its name or instance type cannot be edited from within the Cloudera AI Workbench.

Resource Profile

Resource Profiles are an existing Cloudera AI feature that has been enhanced and is now segregated into CPU and GPU profiles. A resource profile defines the allowed resource combination (CPU cores, memory, and GPU count) that a workload will consume on a node.

These profiles are uniquely configured for one Resource Group (establishing a one-to-many relationship where one Resource Group can have many Resource Groups). While the Resource Group dictates the type of node, the Resource Profile dictates the specific resource combination used, and its defined resource capacity must never exceed the maximum capacity of its associated Resource Group.

Workload Scheduling

The Resource Group feature provides enhanced control over workload scheduling by ensuring that a specific workload lands on the desired type of node. When creating any workload (such as a Session, Job, Application, or Model), the user now selects two key fields: the Resource Group, which defines the instance type (node type) to be used, and the Resource Profile, which dictates the specific CPU, memory, and GPU combination the workload will consume on that chosen node.