AWS Account Prerequisites for

To successfully provision an Cloudera AI Workbench, there are many prerequisites that you must ensure are met. Carefully go through this section step by step.

  1. Review the AWS Account Prerequisites for Cloudera Data Platform

    Verify that the AWS account that you would like to use for Cloudera Data Platform has the required resources and that you have the permissions required to manage these resources.

    Instructions: AWS Account Requirements

  2. Review the Cloudera Machine Learning-Specific AWS Resource Requirements

    Provisioning an Cloudera AI Workbench will require access to the following AWS resources. Make sure your AWS account has access to these resources.

    • AWS Services used by Cloudera Machine Learning
      1. Compute - Amazon Elastic Kubernetes Service (EKS)
      2. Load Balancing - Amazon Network Load Balancer (NLB)
      3. Key Management - AWS Key Management Service (KMS)
      4. DNS - Amazon Route 53, hosted by Cloudera
      5. Persistent Storage - Amazon Elastic Block Store (EBS)
      6. Project File Storage - Amazon Elastic File System (EFS) for project file storage
      7. Command Line Interface - AWS Command Line Interface (CLI).
      8. Security Token Service - AWS Security Token Service (STS)
    • VPC Requirements - You can either use an existing VPC or allow Cloudera Data Platform to create one for you.
      • Option 1. Using your own VPC

        • Recommended requirements: Divide the address space according to the following recommended sizes:
          • 3 x /19 private subnets. Each subnet should be created in a separate Availability Zone for the EKS worker nodes.
          • 3 x /24 public subnets. These should also be created in three separate Availability Zones, using the same zones as the private subnets.
          • Ensure the CIDR block for the subnets is sized appropriately.
          • You must enable Amazon DNS with the VPC. Corporate DNS is not supported. For guidelines on how verify your DNS settings, refer to sections 1-3 in AWS environment requirements checklist for the Data Warehouse service.

          Private subnets should have routable IPs over your internal VPN. If IPs are not routable, private Cloudera Machine Learning endpoints will need to be accessed via a SOCKS proxy. Cloudera recommends creating routable IPs by setting up VPN connections between networks, and not using any public load balancers. If a fully-private network configuration is not feasible, use of a SOCKS proxy to access Cloudera Machine Learning is possible, but is not recommended.

          Tag the VPC and the subnets as shared so that Kubernetes can find them. For load balancers to be able to choose the subnets correctly, you are also required to tag private subnets with the kubernetes.io/role/internal-elb:1 tag, and public subnets with the kubernetes.io/role/elb:1 tag.

      • Option 2. Cloudera Data Platform creates a new VPC

        If you choose to allow Cloudera Data Platform to create a new VPC, three subnets will be created automatically. One subnet is created for each availability zone assuming three AZs per region; If a region has two AZs instead of three, then still three subnets are created, two in the same AZ.

        You will be asked to specify a valid CIDR in IPv4 range that will be used to define the range of private IPs for EC2 instances provisioned into these subnets.

      • Related AWS documentation: Amazon EKS - Cluster VPC Considerations, Creating a VPC for your Amazon EKS Cluster

    • Ports Requirements
      HTTPS access to is available over port 443 for the following cases:
      • internal only - should be accessible from your organization's network, but not the public internet
      • internet facing - should be accessible from the public internet as well as your internal organization's network
      This is in addition to the ports requirements noted here for Cloudera Data Platform's default security group: Management Console - Security groups.
    • Firewall requirements

      Installations must comply with firewall requirements set by cloud providers at all times. Ensure that ports required by the provider are not closed. For example, Kubernetes services have requirements documented in Amazon EKS security group considerations.

      Also, for information on repositories that must be accessible to set up workspaces, see Outbound network access destinations for AWS.

  3. Review the default AWS service limits and your current AWS account limits

    By default, AWS imposes certain default limits for AWS services, per-user account. Make sure you review your account's current usage status and resource limits before you start provisioning additional resources for Cloudera Data Platform and Cloudera Machine Learning.

    For example, depending on your AWS account, you might only be allowed to provision a certain number of CPU instances, or you might not have default access to GPU instances at all. Make sure to review your AWS service limits before your proceed.

    Related AWS documentation: AWS Service Limits, Amazon EC2 Resource Limits.
  4. Review supported AWS regions

    Cloudera Data Platform supports the following AWS regions: Supported AWS regions. However, the Cloudera Machine Learning service requires AWS Elastic Kubernetes Service (EKS). Make sure you select a region that includes EKS.

    Related AWS documentation: Region Table (AWS Documentation).

  5. Set up an AWS Cloud Credential

    Create a role-based AWS credential that allows Cloudera Data Platform to authenticate with your AWS account and has authorization to provision AWS resources on your behalf. Role-based authentication uses an IAM role with an attached IAM policy that has the minimum permissions required to use Cloudera Data Platform.

    Once you have created this IAM policy, register it in Cloudera Data Platform as a cloud credential. Then, reference this credential when you are registering the environment in the next step.

    Instructions: Introduction to the role-based provisioning credential for AWS

  6. Register an AWS Environment

    A Cloudera Data Platform User with the role of Power User must register an environment for their organization. An environment determines the specific cloud provider region and virtual network in which resources can be provisioned, and includes the credential that should be used to access the cloud provider account.

    Instructions: Register an AWS Environment

  7. Ensure private subnets have outbound internet connectivity

    Also, ensure that your private subnets have outbound internet connectivity. Check the route tables of private subnets to verify the internet routing. Worker nodes must be able to download Docker images for Kubernetes, billing and metering information, and to perform API server registration.

  8. Ensure the Amazon Security Token Service (STS) is activated

    To successfully activate an environment in the Data Warehouse service, you must ensure the Amazon STS is activated in your AWS VPC:
    1. In the AWS Management Console home page, select IAM under Security, Identity, & Compliance.
    2. In the Identity and Access Management (IAM) dashboard, select Account settings in the left navigation menu.
    3. On the Account settings page, scroll down to the section for Security Token Service (STS).
    4. In the Endpoints section, locate the region in which your environment is located and make sure that the STS service is activated.
  9. Cloudera Machine Learning Role Requirements

    There are two Cloudera Data Platform user roles associated with the Cloudera Machine Learning service: MLAdmin and MLUser. Any Cloudera Data Platform user with the EnvironmentAdmin (or higher) access level must assign these roles to users who require access to the Cloudera Machine Learning service within their environment.

    Furthermore, if you want to allow users to log in to provisioned workspaces and run workloads on them, this will need to be configured separately.

    Instructions: Configuring User Access to