Troubleshooting Cloudera AI Workbenches on AWS

This topic describes the Cloudera AI Workbench provisioning workflow and tells you how to start debugging issues with Cloudera AI Workbenches on AWS.

Cloudera AI Workbench Provisioning Workflow

When you provision a new Cloudera AI Workbench on AWS, Cloudera AI performs the following actions:

Communicates with the Cloudera Management Console to check your AWS credentials. It will also enable Single Sign-On so that authorized Cloudera users are automatically logged in to the workbench that will be created.
Provisions an NFS filesystem for the workbench on your cloud service provider. On AWS, Cloudera AI will provision storage on EFS.
Provisions a Kubernetes cluster on your cloud service provider. This cluster runs the workbench infrastructure and compute resources. On AWS, Cloudera AI provisions an EKS cluster.
Mounts the provisioned NFS filesytem to the Kubernetes cluster.
Provisions TLS certificates for the workbench using LetsEncrypt.
Registers the workbench with the cloud provider's DNS service. On AWS, this is Route53.
Installs Cloudera AI onto the EKS cluster.

Troubleshooting resources

Any of the steps listed above can experience failures. To start debugging, you will require access to one or more of the following resources.

Workbench > Details Page
Each workbench has an associated Details page that lists important information about the workbench. To access this page, sign in to Cloudera, go to Cloudera AI Workbenches and click on the workbench name.
This page lists basic information about the workbench such as who created it and when. More importantly, it includes a link to the environment where the workbench was created, a link to the underlying EKS cluster on AWS, a list of tags associated with the workbench, and the computing resources in use. The rest of this topic explains how to use these resources.
Workbench > Events Page
Each workbench also has an associated Events page that captures every action performed on the workbench. This includes creating, upgrading, and removing the workbench, among other actions. To access this page, sign in to Cloudera, go to Cloudera AI Workbenchs, click on the workbench name, and then click Events.
Click the View Logs button associated with an action to see a high-level overview of all the steps performed by Cloudera AI to complete the action.
The Request ID associated with each action is especially useful in case of a failure as it allows Cloudera Support to efficiently track the series of operations that led to the failure.
Environment > Summary Page

Cloudera AI Workbenches depend quiteCloudera AI Workbencheshe environment in which they are provisioned. Each environment's Summary page lists useful information that can help you debug issues with the Cloudera AI service. You can access the environment directly from the workbench Details page.
This page includes important information such as:
- Credential Setup - Tells you how security has been configured for the environment. Your cloud credential gives Cloudera access to the region and virtual network that make up the environment thus allowing Cloudera to provision resources within that environment.
- Region - The AWS region where the environment is provisioned. This is especially important because it tells you which region's AWS console you might need to access for further debugging.
- Network - The VPC and subnets that were created for the environment. Each Cloudera AI Workbench requires a set of unique IP addresses to run all of its associated Kubernetes services. If you begin to run out of IP addresses, you will need these VPC and subnet IDs to debug further in the AWS console.
- Logs - When you create a Cloudera environment, you are asked to specify an S3 bucket in that environment that will be used to store logs. All Cloudera AI operational logs and Spark logs are also written to this bucket.
  
  You can use the AWS console to access these logs. Alternatively, Site administrators can download these logs directly from their workbench Site Admin panel (Admin > Support).
  
  note
  If you file a support case, Cloudera Support will not automatically have access to these logs because they live in your environment.
AWS Management Console

If you have all the relevant information about the environment and the workbench, you can go to the AWS console (for the region where your environment was created) to investigate further. The AWS Management Console has links to dashboards for all the services used by Cloudera AI.
- EC2
  You can use the EC2 service dashboards to check the instance-type (CPU, GPU), VPC, subnet, and security group limits imposed on your AWS account. For example, there is typically a limit of 5 VPCs per region.
  If you need more resources, submit a request to Amazon to raise the limit of a resource.
- EKS
  EKS will give you more information such as the version of Kubernetes Cloudera AI is using, network information, and the status of the cluster. The workbench Details page gives you a direct link to the provisioned EKS cluster on the AWS console.
  note
  By default, users do not have Kubernetes-level access to the EKS cluster. If a user wants to use kubectl to debug issues with the EKS cluster directly, an MLAdmin must explicitly grant access using the instructions provided here: Granting Remote Access to Cloudera AI Workbenches on EKS.
- VPC
  Use the VPC ID obtained from the Cloudera environment Summary page to search for the relevant VPC where you have provisioned or are trying to provision an Cloudera AI Workbench. Each Cloudera AI Workbench requires a set of unique IP addresses to run all of its associated Kubernetes services. You can use this service to see how many IP addresses are available for each subnet.
- S3
  Use the S3 bucket configured for the environment to check/download logs for more debugging.
- Tags
  When provisioning an Cloudera AI Workbench, you will have the option to assign one or more tags to the workbench. These tags are then applied to all the underlying AWS resources used by the workbench. If failures occur during provisioning or de-provisioning, it can be very useful to simply query the tags associated with the workbench to see if any resources need to be cleaned up manually. Tags associated with a workbench are available on the workbench Details page.
  You can search by tags in the EC2 and VPC services. You can also use the AWS CLI to search for specific tags: resourcegroupstaggingapi
- Trusted Advisor (available with AWS Support)
  Use the Trusted Advisor dashboard for a high-level view of how you are doing with your AWS account. The dashboard displays security risks, service limits, and possible areas to optimize resource usage. If you have access to AWS Support, Cloudera recommends to review your current account status with Trusted Advisor before you start provisioning Cloudera AI Workbenches.

Troubleshooting Cloudera AI Workbenches on AWS

We want your opinion

How can we improve this page?