AWS Prerequisites

Amazon Web Services (AWS) prerequisites for Cloudera Data Engineering (CDE).

1. Review the AWS account prerequisites for CDP

Refer to the CDP AWS account requirements and verify that the AWS account you are using for CDP has the required resources, and that you have the permissions required to manage these resources.

2. Review the CDE-specific AWS Resource Requirements

Provisioning a CDE service and virtual clusters require access to the following AWS resources.

AWS Services used by Cloudera Data Engineering (CDE)
  • Network – Amazon VPC (see below for requirements)
  • Compute – Amazon Elastic Kubernetes Service (EKS)
  • Load Balancing – Amazon ELB Classic Load Balancer
  • Key Management – AWS Key Management Service (KMS)
  • DNS – Amazon Route 53 (CDE makes use of this but it is hosted in Cloudera's AWS infrastructure)
  • Persistent Instance Storage – Amazon Elastic Block Store (EBS)
  • Persistent Service and Virtual Cluster Storage – Amazon Elastic File System (EFS)
  • Database – Amazon Relational Database Service (RDS)

VPC Requirements

You can use an existing VPC, or allow CDP to create one when you create an environment.

Option 1: use your own VPC

Minimum requirements:

  • CDE requires at least two subnets, each in a different Availability Zone (AZ). If you require a public endpoint for CDE, provision at least one public subnet.
  • Ensure that the CIDR block for the subnets is sized appropriately. For each CDE environment, in addition to ensuring enough IPs to accomodate the maximum number of autoscaling compute instances, allow for a fixed overhead of three instances for core CDE services and approximately one instance for every two virtual clusters.
  • You must enable DNS for the VPC.

Recommended setup:

  • Cloudera recommends that you provision at least three subnets, each in a different Availability Zone (AZ). If you do not require a public endpoint, use three private subnets. If you require a public endpoint, use at least two private subnets and one public subnet.
  • Private subnets should have routable IPs over your internal VPN. If IPs are not routable, private CDE endpoints must be accessed via a SOCKS. This is not recommended.
  • Tag the VPC and the subnets as shared so that Kubernetes can find them. For load balancers to be able to choose the subnets correctly, you are also required to tag private subnets with the kubernetes.io/role/internal-elb:1 tag, and public subnets with the kubernetes.io/role/elb:1 tag.

Note that only the load balancer needs to be on a public subnet for access to CDE. By default, if they are available, CDE will configure the EKS to run on private subnets.

Option 2: CDP creates a new VPC

If you choose to allow CDP to create a new VPC, three subnets will be automatically created. One subnet is created for each availability zone assuming three AZs per region; If a region has two AZs instead of three, three subnets are still created, with two in the same AZ.

You will be asked to specify a valid CIDR in IPv4 range that will be used to define the range of private IPs for EC2 instances provisioned into these subnets.

Related AWS documentation: Amazon EKS - Cluster VPC Considerations, Creating a VPC for your Amazon EKS Cluster

Ports Requirements

HTTPS access to CDE endpoints is available over port 443 for the following cases:

  • Internal only – Should be accessible from your organization's network, but not the public internet.
  • Internet facing (public endpoint) – Should be accessible from the public internet as well as your organization's internal network.

Note: This is in addition to the ports requirements noted here for CDP's default security group: Management Console - Security groups.

3. Review the default AWS service limits and your current AWS account limits

By default, AWS imposes certain default limits for AWS services for each user account. Make sure you review your account's current usage status and resource limits before you start provisioning additional resources for CDP and CDE.

For example, depending on your AWS account, you may only be allowed to provision a certain number of EC2 instances. Be sure to review your AWS service limits before your proceed.

Related AWS documentation: AWS Service Limits, Amazon EC2 Resource Limits.

4. Review supported AWS regions

CDP supports the following AWS regions: Supported AWS regions. However, the CDE service also requires AWS Elastic Kubernetes Service (EKS). Make sure you select a region that includes EKS.

Related AWS documentation: Region Table.

5. Set up an AWS Cloud Credential

Create a role-based AWS credential that allows CDP to authenticate with your AWS account and has authorization to provision AWS resources on your behalf. Role-based authentication uses an IAM role with an attached IAM policy that has the minimum permissions required to use CDP.

Once you have created this IAM policy, register it in CDP as a cloud credential. Reference this credential when you register an AWS environment in CDP environment as described in the next step.

Instructions: CDP Cloud Credential for AWS

6. Register an AWS Environment in CDP

A CDP user must have the Power User role in order to register an environment. An environment determines the specific cloud provider region and virtual network in which resources can be provisioned, and includes the credential that should be used to access the cloud provider account.

Instructions: Register an AWS environment

7. CDE Role Requirements

There are two CDP user roles associated with the CDE service: DEAdmin and DEUser. Any CDP user with the EnvironmentAdmin (or higher) access level must assign these roles to users who require access to the Cloudera Data Engineering console within their environment.

Furthermore, if you want to allow users to log in to provisioned workspaces and run workloads on them, this will need to be configured separately.

8. Set up the AWS account to run kubectl commands

  1. In the AWS console, create an IAM user ( for example, kubectl-user) with Programmatic access (you don't need to grant any permissions).
  2. Note the User ARN and copy the Access key ID and Secret access key and set up an AWS profile as follows:
    [kubectl-user]

    aws_access_key_id = <Access Key ID>
    aws_secret_access_key = <Secret access key>
  3. Navigate to IAM Roles and edit the cross-account IAM role (note the Role ARN) that was created as part of the CDP prerequisites.
  4. Navigate to Trust relationships > Edit trust relationships.
  5. Add the following to the policy document, then click Update trust policy.
     "Effect": "Allow",
     "Principal": {
      "AWS": "User ARN from step 2"
     },
     "Action": "sts:AssumeRole"
     },
  6. Download the kubeconfig file from the CDE UI and save it ( ~/.kube/cde-env1-kube-config, for example), then run the following shell commands:
    $ export AWS_PROFILE=kubectl-user
    $ unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN
    $ cred=$(aws sts assume-role --role-arn <Role ARN from step 3> --role-session-name test | jq .Credentials)
    $ export AWS_ACCESS_KEY_ID=$(echo $cred|jq .AccessKeyId|tr -d '"')
    $ export AWS_SECRET_ACCESS_KEY=$(echo $cred|jq .SecretAccessKey|tr -d '"')
    $ export AWS_SESSION_TOKEN=$(echo $cred|jq .SessionToken|tr -d '"')
    $ export KUBECONFIG=~/.kube/cde-env1-kube-config
    $ export TILLER_NAMESPACE=tiller
  7. You should now be able to run kubectl commands.

9. Browser Requirements

Supported browsers:

  • Chrome
  • Safari

Unsupported browsers:

  • Firefox