Achieving optimal results from a Hadoop implementation begins with choosing the correct hardware and software stacks. The effort involved in the planning stages can pay off dramatically in terms of the performance and the total cost of ownership (TCO) associated with the environment. Additionally, the following composite system stack recommendations can help benefit organizations in the planning stages:
Machine Type | Workload Pattern | Storage | Processor (# of Cores) | Memory (GB) | Network |
---|---|---|---|---|---|
Slaves | Balanced workload | Four to six 2 TB disks | One Quad | 24 | 1 GB Ethernet all-to-all |
Masters | Balanced workload | Four to six 2 TB disks | Dual Quad | 24 |
Machine Type | Workload Pattern | Storage | Processor (# of Cores) | Memory (GB) | Network |
---|---|---|---|---|---|
Slaves | Balanced workload | Four to six 1 TB disks | Dual Quad | 24 | Dual 1 GB links for all nodes in a 20 node rack and 2 x 10 GB interconÂnect links per rack going to a pair of central switches. |
Compute intensive workload | Four to six 1 TB or 2 TB disks | Dual Hexa Quad | 24-48 | ||
I/O intensive workload | Twelve 1 TB disks | Dual Quad | 24-48 | ||
Masters | All workload patterns | Four to six 2 TB disks | Dual Quad | Depends on number of file system objects to be created by NameNode. |
For Further Reading
-
Best Practices for Selecting Apache Hadoop Hardware (Hortonworks blog)
-
Hadoop Network and Compute Architecture Considerations by Jacob Rapp, Cisco (Hadoop World 2011 presentation)
-
Hadoop network design challenge (Brad Hedlund.com)
-
Scott Carey's email on smaller hardware for smaller clusters (email to general@hadoop.apache.org, Wed, 10 Aug 2011 17:24:25 GMT)
-
Failure Trends in a Large Disk Drive Population - Google Research Paper