Setting up Cloudera AI Workbenches for high volume Workloads
Autoscaling in Cloudera AI enables seamless scaling up of clusters, accommodating sessions, experiments, model metrics, jobs, and applications with increased user demand. In order to ensure seamless functioning of Cloudera AI Workbenches with high volume workloads, core Cloudera AI flows such as workbench suspend, resume, backup, upgrade, and editing are validated with high volume workloads. Additionally, clusters can be efficiently downsized by adjusting the autoscale range in workloads without disrupting control plane capabilities.
To prepare a Cloudera AI Workbench for high volume workloads, you need to perform the following steps to modify the pod quota limit for the Cloudera AI tenant, and then do the same for the Cloudera AI Workbench.
- Go inside the workbench and create a project.
- In Project settings, click the lock button.
- In Advanced settings, add the key
OVERRIDE_PODQUOTA
and enter the value for the new pod limit to set.
Verified flows and configurations
- Enhanced Pod Count Limit
Scaling up to 250 Pods within a single user namespace is verified, assuming that the workbench possesses sufficient computational resources and storage capacity to accommodate these pods.
- Auto-Scaling Certification
The Control Plane actions have been verified to scale up to 100 Nodes.
- Suspension and resumption of 100 Nodes has been successfully tested.
- Workbench Management
- Backup of workbenches with up to 100 Nodes.
- Upscale and downscale of cluster nodes between 1-100 is now verified by changing the worker node range within the edit section of workbench.
- Successfully tested upgrading a 100 node Workbench from version at 'V-1'.
- Basic sanity testing of the following workload flows was conducted for workbenchess with
100 Nodes:
- Creation of Sessions, Applications, Jobs, Model Metrics and Experiments for a user after workbench creation and upgrading.
- Deletion of Sessions, Applications, Jobs, Model Metrics and Experiments for high volume workloads.
- I/O, memory usage validation during upscale and downscale operations.
Known issues
There are a few known issues that have been identified during the certification process. These issues are currently being addressed.
- During concurrent creation of sessions via Cloudera AI Workbench users, some pods may fail to come up, with a ~3% failure rate.
- For high volume workbenches, Cloudera AI Workbench applications may fail to restart after modify/resume and upgrade operations. In this case, manually restart the affected application.