Chapter 2. File System Partitioning Recommendations

Setting Up File System Partitions

Use the following as a base configuration for all nodes in your cluster:

Partitioning Recommendations for Slave Nodes

Hadoop Slave node partitions: Hadoop should have its own partitions for Hadoop files and logs. Drives should be partitioned using ext3, ext4, or XFS, in that order of preference. HDFS on ext3 has been publicly tested on the Yahoo cluster, which makes it the safest choice for the underlying file system. The ext4 file system may have potential data loss issues with default options because of the "delayed writes" feature. XFS reportedly also has some data loss issues upon power failure. Do not use LVM; it adds latency and causes a bottleneck.
On slave nodes only, all Hadoop partitions should be mounted individually from drives as "/grid/[0-n]".
Hadoop Slave Node Partitioning Configuration Example:
- /root - 20GB (ample room for existing files, future log file growth, and OS upgrades)
- /grid/0/ - [full disk GB] first partition for Hadoop to use for local storage
- /grid/1/ - second partition for Hadoop to use
- /grid/2/ - ...

Redundancy (RAID) Recommendations

Master nodes -- Configured for reliability (RAID 10, dual Ethernet cards, dual power supplies, etc.)
Slave nodes -- RAID is not necessary, as failure on these nodes is managed automatically by the cluster. All data is stored across at least three different hosts, and therefore redundancy is built-in. Slave nodes should be built for speed and low cost.

Further Reading

The following additional documentation may be useful:

Hortonworks Knowledge-Base article on options for selecting your underlying Linux file system: Best Practices: Linux File Systems for HDFS
CentOS partitioning documentation: Partitioning Your System
Reference architectures from other Hadoop clusters: Hadoop Reference Architectures