Recommended Configuration on Amazon Web Services (AWS)
On AWS, Cloudera Data Science Workbench must be used with persistent/long-running Apache Hadoop clusters only.
CDH and Cloudera Manager Hosts
- For instructions on deploying CDH and Cloudera Manager on AWS, refer the Cloudera Reference Architecture for AWS deployments.
Cloudera Data Science Workbench Hosts
-
Operations
- Use Cloudera Director to orchestrate operations. Use Cloudera Manager to monitor the cluster.
-
Networking
- No security group or network restrictions between hosts.
- HTTP connectivity to the corporate network for browser access. Do not use proxies or manual SSH tunnels.
-
Recommended Instance Types
-
m4.4xlarge–m4.16xlarge
In this case, bigger is better. That is, one m4.16large is better than four m4.4xlarge hosts. AWS pricing scales linearly, and larger instances have more EBS bandwidth.
-
-
Storage
- 100 GB root volume block device (gp2) on all hosts
- 500 GB Docker block devices (gp2) on all hosts
- 1 TB Application block device (io1) on master host