Common Cloudera AI errors and solutions

The following sections describe recommended steps to start debugging common error messages you might see in the workbench logs (found under Events > View Logs).

Before you begin

Make sure you have reviewed the list of resources available to you for debugging on Cloudera AI and AWS: Troubleshooting Cloudera AI Workbenches on AWS

Timezone not properly set for scheduled job

If the timezone is no longer correctly set for a scheduled job, then you should simply set it again. Go to Job Settings, edit the timezone, and update the job.

AWS account resource limits exceeded (Compute, VPC, etc.)

Cloudera AI Workbench provisioning fails because Cloudera Data Platform could not get access to all the AWS resources needed to deploy a Cloudera AI Workbench. This is likely because your AWS account either does not have access to those resources or is hitting the resource limits imposed on it.

Sample errors include (from Events > View Logs):
Failed to provision cluster. Reason: Failed to wait for provisioner: Wait for status failed with status CREATE_FAILED: error creating eks cluster (cause: InvalidParameterException: Provided subnets subnet-0a648a0cc5976b7a9 Free IPs: 0 , need at least 3 IPs in each subnet to be free for this operation
Failed to mount storage. Reason: Failed to create mount target: NoFreeAddressesInSubnet: The specified subnet does not have enough free addresses to satisfy the request. 

AWS accounts have certain hard and soft resource limits imposed on them by default. For example, certain CPU/GPU instances that Cloudera AI allows you to provision might even have an initial default limit of 0 (set by AWS). This means if you attempt to provision a cluster with those instance types, your request will fail.

Aside from the CPU and GPU compute resource limits, there are other types of limits you can run into. For example, the second error shows that the subnets in your VPC do not have any more free IP addresses for the workbench (and each of the underlying Kubernetes pods). This occurs if the CIDR range mentioned while registering the environment was not large enough for your current needs.

You can use the AWS console to request an increase in limits as needed. Go to the AWS console for the region where the environment was provisioned and then navigate to EC2 > Limits.

For networking failures, navigate to EC2 > VPC. Search for the environment's VPC ID (available on environment Summary page) to see the list of available IP addresses for each subnet. Request more resources as needed.

Related AWS documentation: AWS Service Limits, Amazon EC2 Resource Limits, EKS Cluster VPC Considerations, AWS CNI Custom Networking.

Access denied to AWS credential used for authentication

The cloud credential used to give Cloudera Data Platform access to your AWS account failed authentication. Therefore, Cloudera Data Platform could not provision the resources required to deploy a Cloudera AI Workbench.

Sample error (from Events > View Logs):
Failed to provision storage. Reason: Failed to create new file system: AccessDenied: User: arn:aws:iam::1234567890:user/cross-account-trust-user is not authorized to perform:

Your cloud credential gives Cloudera Data Platform access to the region and virtual network that make up the environment thus allowing Cloudera Data Platform to provision resources within that environment. If authentication fails, go to your environment to see how the cloud credentials were set up and confirm whether your account has the permission to perform these actions.

Cloudera AI installation failures

While the steps to provision resources on AWS were completed successfully, the Cloudera AI Workbench installation on EKS failed.

Sample error (from Events > View Logs):
Failed to install Cloudera AI Workbench. Reason:Error: release mlx-mlx failed: timed out waiting for the condition

If you are an advanced user, you can log in to the underlying EKS cluster and use kubectlto investigate further into which pods are failing.

Related AWS documentation: EKS and kubectl

Failures due to API throttling

These errors can be harder to prepare for due to their seemingly random nature. Occasionally, AWS will block API calls if it receives too many requests at the same time. For example, this can occur when multiple users are attempting to provision/delete/upgrade clusters at the same time.

Sample error (from Events > View Logs):
Failed to delete cluster. Reason: Failed to wait for deletion: Wait for status failed with status DELETE_FAILED: Throttling: Rate exceeded

Currently, if you see a 'Throttling: Rate exceeded' error, our recommendation is that you simply try again later.

Related AWS documentation: AWS API Request Throttling

De-provisioning failures

De-provisioning operations can sometimes fail if AWS resources are not terminated in the right order. This is usually due to timing issues where certain resources might take too long to terminate. This can result in a cascading set of failures where AWS cannot delete the next set of resources because they still have active dependencies on the previous set.

Sample error (from Events > View Logs):
Failed to delete cluster. Reason: Failed to wait for deletion: DELETE_FAILED: msg: failed to delete aws stack
Cloudformation says resource xyz has a dependent object (Service: AmazonEC2; Status Code: 400; Error Code: DependencyViolation; Request ID: 815928e2-277e-4b8b-9fed-4b89716a205b) EKS - cluster still existed, was blocking CF delete

Cloudera AI includes a Force Delete option now that will remove the workbench from the Cloudera AI service. However, this not mean all the underlying resources have been cleaned up. This is where tags are very useful.

If you assigned tags to the workbench at the time of provisioning, you can use the AWS console or the CLI to query the tags associated with the workbench to see if any resources need to be cleaned up manually. Tags associated with a workbench are available on the workbench Details page.

You can search by tags in the EC2 and VPC services. You can also use the AWS CLI to search for specific tags: resourcegroupstaggingapi

Users unable to access provisioned Cloudera AI Workbenches

If you have provisioned a workbench but your colleagues cannot automatically access the workbench using Cloudera Data Platform Single-Sign on, make sure that you have completed all the steps required to grant users access to workbenches: Configuring User Access to Cloudera AI. All Cloudera AI users must have Cloudera Data Platform accounts.