Troubleshooting ML Workspaces on AWS

This topic describes the ML workspace provisioning workflow and tells you how to start debugging issues with ML workspaces on AWS.

ML Workspace Provisioning Workflow

When you provision a new ML Workspace on AWS, CML performs the following actions:
  1. Communicates with the CDP Management Console to check your AWS credentials. It will also enable Single Sign-On so that authorized CDP users are automatically logged in to the workspace that will be created.
  2. Provisions an NFS filesystem for the workspace on your cloud service provider. On AWS, CML will provision storage on EFS.
  3. Provisions a Kubernetes cluster on your cloud service provider. This cluster runs the workspace infrastructure and compute resources. On AWS, CML provisions an EKS cluster.
  4. Mounts the provisioned NFS filesytem to the Kubernetes cluster.
  5. Provisions TLS certificates for the workspace using LetsEncrypt.
  6. Registers the workspace with the cloud provider's DNS service. On AWS, this is Route53.
  7. Installs Cloudera Machine Learning onto the EKS cluster.

Troubleshooting Resources

Any of the steps listed above can experience failures. To start debugging, you will require access to one or more of the following resources.
  • Workspace > Details Page

    Each workspace has an associated Details page that lists important information about the workspace. To access this page, sign in to CDP, go to ML Workspaces and click on the workspace name.

    This page lists basic information about the workspace such as who created it and when. More importantly, it includes a link to the environment where the workspace was created, a link to the underlying EKS cluster on AWS, a list of tags associated with the workspace, and the computing resources in use. The rest of this topic explains how to use these resources.

  • Workspace > Events Page

    Each workspace also has an associated Events page that captures every action performed on the workspace. This includes creating, upgrading, and removing the workspace, among other actions. To access this page, sign in to CDP, go to ML Workspaces, click on the workspace name, and then click Events.

    Click the View Logs button associated with an action to see a high-level overview of all the steps performed by CML to complete the action.

    The Request ID associated with each action is especially useful in case of a failure as it allows Cloudera Support to efficiently track the series of operations that led to the failure.

  • Environment > Summary Page

    CML workspaces depend quite heavily on the environment in which they are provisioned. Each environment's Summary page lists useful information that can help you debug issues with the CML service. You can access the environment directly from the workspace Details page.

    This page includes important information such as:
    • Credential Setup - Tells you how security has been configured for the environment. Your cloud credential gives CDP access to the region and virtual network that make up the environment thus allowing CDP to provision resources within that environment.
    • Region - The AWS region where the environment is provisioned. This is especially important because it tells you which region's AWS console you might need to access for further debugging.
    • Network - The VPC and subnets that were created for the environment. Each CML workspace requires a set of unique IP addresses to run all of its associated Kubernetes services. If you begin to run out of IP addresses, you will need these VPC and subnet IDs to debug further in the AWS console.
    • Logs - When you create a CDP environment, you are asked to specify an S3 bucket in that environment that will be used to store logs. All CML operational logs and Spark logs are also written to this bucket.

      You can use the AWS console to access these logs. Alternatively, Site administrators can download these logs directly from their workspace Site Admin panel (Admin > Support).

  • AWS Management Console

    If you have all the relevant information about the environment and the workspace, you can go to the AWS console (for the region where your environment was created) to investigate further. The AWS Management Console has links to dashboards for all the services used by CML.

    • EC2

      You can use the EC2 service dashboards to check the instance-type (CPU, GPU), VPC, subnet, and security group limits imposed on your AWS account. For example, there is typically a limit of 5 VPCs per region.

      If you need more resources, submit a request to Amazon to raise the limit of a resource.

    • EKS

      EKS will give you more information such as the version of Kubernetes CML is using, network information, and the status of the cluster. The workspace Details page gives you a direct link to the provisioned EKS cluster on the AWS console.

    • VPC

      Use the VPC ID obtained from the CDP environment Summary page to search for the relevant VPC where you have provisioned or are trying to provision an ML workspace. Each CML workspace requires a set of unique IP addresses to run all of its associated Kubernetes services. You can use this service to see how many IP addresses are available for each subnet.

    • S3

      Use the S3 bucket configured for the environment to check/download logs for more debugging.

    • Tags

      When provisioning an ML workspace, you will have the option to assign one or more tags to the workspace. These tags are then applied to all the underlying AWS resources used by the workspace. If failures occur during provisioning or de-provisioning, it can be very useful to simply query the tags associated with the workspace to see if any resources need to be cleaned up manually. Tags associated with a workspace are available on the workspace Details page.

      You can search by tags in the EC2 and VPC services. You can also use the AWS CLI to search for specific tags: resourcegroupstaggingapi

    • Trusted Advisor (available with AWS Support)

      Use the Trusted Advisor dashboard for a high-level view of how you are doing with your AWS account. The dashboard displays security risks, service limits, and possible areas to optimize resource usage. If you have access to AWS Support, it's a good idea to review your current account status with Trusted Advisor before you start provisioning ML workspaces.