Recommended Configuration on Amazon Web Services (AWS)
On AWS, Cloudera Data Science Workbench must be used with persistent/long-running Apache Hadoop clusters only.
CDH and Cloudera Manager Hosts
- For instructions on deploying CDH and Cloudera Manager on AWS, refer the Cloudera Reference Architecture for AWS deployments.
Cloudera Data Science Workbench Hosts
- Use Cloudera Director to orchestrate operations. Use Cloudera Manager to monitor the cluster.
- No security group or network restrictions between hosts.
- HTTP connectivity to the corporate network for browser access. Do not use proxies or manual SSH tunnels.
Recommended Instance Types
In this case, bigger is better. That is, one m4.16large is better than four m4.4xlarge hosts. AWS pricing scales linearly, and larger instances have more EBS bandwidth.
- 100 GB root volume block device (gp2) on all hosts
- 500 GB Docker block devices (gp2) on all hosts
- 1 TB Application block device (io1) on master host