Managing multiple CPU and GPU workloads using Resource Groups
Learn how to use Resource Groups and Resource Profiles to provision and manage multiple CPU and GPU environments within your Cloudera AI workbench. This provides enhanced control over workload scheduling and allows for the segregation of workloads based on instance types.
This feature eliminates the previous limitation of having only one CPU and a maximum of one GPU instance group per workspace. You can now provision multiple groups of both CPU and GPU instances as needed.
Resource Group
Resource Group serves as a parent entity that defines the specific type of node (for example, m5.xlarge) where your workloads will execute. Resource Groups are provisioned using the Cloudera AI Control Plane, and every workbench requires at least one CPU resource group, which is identifiable by a unique name specified when creating the resource group. While multiple resource groups can share the same instance type, a single resource group maps to only one type, and its name or instance type cannot be edited from within the Cloudera AI Workbench.
Resource Profile
Resource Profiles are an existing Cloudera AI feature that has been enhanced and is now segregated into CPU and GPU profiles. A resource profile defines the allowed resource combination (CPU cores, memory, and GPU count) that a workload will consume on a node.
These profiles are uniquely configured for one Resource Group (establishing a one-to-many relationship where one Resource Group can have many Resource Groups). While the Resource Group dictates the type of node, the Resource Profile dictates the specific resource combination used, and its defined resource capacity must never exceed the maximum capacity of its associated Resource Group.Workload Scheduling
The Resource Group feature provides enhanced control over workload scheduling by ensuring that a specific workload lands on the desired type of node. When creating any workload, such as a Session, Job, Application, or Model, the user now selects two key fields, the Resource Group, which defines the instance type (node type) to be used, and the Resource Profile, which dictates the specific CPU, memory, and GPU combination the workload will consume on that chosen node.
Resource Group feature transition and legacy profiles
When you upgrade your environment from a version where the Resource Group feature was disabled to a version where it is enabled, the behavior of your existing Resource Profiles will change, as they must now be linked to a specific Resource Group.
- Legacy profile visibility
All existing legacy Resource Profiles that were created without an associated Resource Group will no longer be visible in the UI after the upgrade.
Consequently, when a user attempts to launch a new Session, Job, Application, or Model, they will not see the unassociated legacy profile listed. These profiles are effectively inactive until they are manually linked to a Resource Group by an Administrator.
- Automatic default profile creationDuring the upgrade process, the system automatically creates the minimum required Resource Group and its associated profiles to ensure basic functionality:
- Default CPU Resource Group: The system creates one default CPU Resource Group.
- Default Profiles: The system automatically attaches two default resource
profiles to this new CPU Resource Group.
These two newly created default profiles are the only Resource Profiles visible immediately after the upgrade, until administrators manually restore or create others. This ensures that users have at least a basic set of profiles available to start workloads.
