AWS Account Prerequisites for ML Workspaces

To successfully provision an ML workspace, there are many prerequisites that you must ensure are met. Carefully go through this section step by step.

  1. Review the AWS Account Prerequisites for CDP

    Verify that the AWS account that you would like to use for CDP has the required resources and that you have the permissions required to manage these resources.

    Instructions: AWS Account Requirements

  2. Review the Cloudera Machine Learning-Specific AWS Resource Requirements

    Provisioning an ML workspace will require access to the following AWS resources. Make sure your AWS account has access to these resources.

    • AWS Services used by Cloudera Machine Learning (CML)
      1. Compute - Amazon Elastic Kubernetes Service (EKS)
      2. Load Balancing - Amazon Network Load Balancer (NLB)
      3. Key Management - AWS Key Management Service (KMS)
      4. DNS - Amazon Route 53, hosted by Cloudera
      5. Persistent Storage - Amazon Elastic Block Store (EBS)
      6. Project File Storage - Amazon Elastic File System (EFS) for project file storage
      7. Command Line Interface - AWS Command Line Interface (CLI).
      8. Security Token Service - AWS Security Token Service (STS)
    • VPC Requirements - You can either use an existing VPC or allow CDP to create one for you.
      • Option 1. Using your own VPC

        • Recommended requirements: Divide the address space according to the following recommended sizes:
          • 3 x /19 private subnets. Each subnet should be created in a separate Availability Zone for the EKS worker nodes.
          • 3 x /24 public subnets. These should also be created in three separate Availability Zones, using the same zones as the private subnets.
          • Ensure the CIDR block for the subnets is sized appropriately.
          • You must enable Amazon DNS with the VPC. Corporate DNS is not supported. For guidelines on how verify your DNS settings, refer to sections 1-3 in AWS environment requirements checklist for the Data Warehouse service.

          Private subnets should have routable IPs over your internal VPN. If IPs are not routable, private CML endpoints will need to be accessed via a SOCKS proxy. Cloudera recommends creating routable IPs by setting up VPN connections between networks, and not using any public load balancers. If a fully-private network configuration is not feasible, use of a SOCKS proxy to access CML is possible, but is not recommended.

          Tag the VPC and the subnets as shared so that Kubernetes can find them. For load balancers to be able to choose the subnets correctly, you are also required to tag private subnets with the kubernetes.io/role/internal-elb:1 tag, and public subnets with the kubernetes.io/role/elb:1 tag.

      • Option 2. CDP creates a new VPC

        If you choose to allow CDP to create a new VPC, three subnets will be created automatically. One subnet is created for each availability zone assuming three AZs per region; If a region has two AZs instead of three, then still three subnets are created, two in the same AZ.

        You will be asked to specify a valid CIDR in IPv4 range that will be used to define the range of private IPs for EC2 instances provisioned into these subnets.

      • Related AWS documentation: Amazon EKS - Cluster VPC Considerations, Creating a VPC for your Amazon EKS Cluster

    • Ports Requirements
      HTTPS access to ML workspaces is available over port 443 for the following cases:
      • internal only - should be accessible from your organization's network, but not the public internet
      • internet facing - should be accessible from the public internet as well as your internal organization's network
      This is in addition to the ports requirements noted here for CDP's default security group: Management Console - Security groups.
    • Firewall requirements

      Installations must comply with firewall requirements set by cloud providers at all times. Ensure that ports required by the provider are not closed. For example, Kubernetes services have requirements documented in Amazon EKS security group considerations.

      Also, for information on repositories that must be accessible to set up workspaces, see Outbound network access destinations for AWS.

  3. Review the default AWS service limits and your current AWS account limits

    By default, AWS imposes certain default limits for AWS services, per-user account. Make sure you review your account's current usage status and resource limits before you start provisioning additional resources for CDP and CML.

    For example, depending on your AWS account, you might only be allowed to provision a certain number of CPU instances, or you might not have default access to GPU instances at all. Make sure to review your AWS service limits before your proceed.

    Related AWS documentation: AWS Service Limits, Amazon EC2 Resource Limits.
  4. Review supported AWS regions

    CDP supports the following AWS regions: Supported AWS regions. However, the CML service requires AWS Elastic Kubernetes Service (EKS). Make sure you select a region that includes EKS.

    Related AWS documentation: Region Table (AWS Documentation).

  5. Set up an AWS Cloud Credential

    Create a role-based AWS credential that allows CDP to authenticate with your AWS account and has authorization to provision AWS resources on your behalf. Role-based authentication uses an IAM role with an attached IAM policy that has the minimum permissions required to use CDP.

    Once you have created this IAM policy, register it in CDP as a cloud credential. Then, reference this credential when you are registering the environment in the next step.

    Instructions: Introduction to the role-based provisioning credential for AWS

  6. Register an AWS Environment

    A CDP User with the role of Power User must register an environment for their organization. An environment determines the specific cloud provider region and virtual network in which resources can be provisioned, and includes the credential that should be used to access the cloud provider account.

    Instructions: Register an AWS Environment

  7. Ensure private subnets have outbound internet connectivity

    Also, ensure that your private subnets have outbound internet connectivity. Check the route tables of private subnets to verify the internet routing. Worker nodes must be able to download Docker images for Kubernetes, billing and metering information, and to perform API server registration.

  8. Ensure the Amazon Security Token Service (STS) is activated

    To successfully activate an environment in the Data Warehouse service, you must ensure the Amazon STS is activated in your AWS VPC:
    1. In the AWS Management Console home page, select IAM under Security, Identity, & Compliance.
    2. In the Identity and Access Management (IAM) dashboard, select Account settings in the left navigation menu.
    3. On the Account settings page, scroll down to the section for Security Token Service (STS).
    4. In the Endpoints section, locate the region in which your environment is located and make sure that the STS service is activated.
  9. CML Role Requirements

    There are two CDP user roles associated with the CML service: MLAdmin and MLUser. Any CDP user with the EnvironmentAdmin (or higher) access level must assign these roles to users who require access to the Cloudera Machine Learning service within their environment.

    Furthermore, if you want to allow users to log in to provisioned workspaces and run workloads on them, this will need to be configured separately.

    Instructions: Configuring User Access to ML Workspaces