Data density per drive

Hard drives come in many sizes today. Popular drive sizes are 1-4 TB, although larger drives are more common now. When picking a drive size, consider the cost per TB, replication storm in case of drive failure, and cluster performance.

  • Lower Cost Per TB: The larger the drive, the cheaper the cost per TB, which makes for lower TCO.
  • Replication Storms: Larger drive means drive failures will produce larger re-replication storms, which can take longer and saturate the network while impacting in-flight workloads.
  • Cluster Performance: In general, drive size has little impact on cluster performance. The exception is when drives have different read/write speeds and a use case that leverages this gain. MapReduce is designed for long sequential reads and writes, so latency timings are generally not as important. HBase can potentially benefit from faster drives, but that is dependent on a variety of factors, such as HBase access patterns and schema design; this also implies acquisition of more nodes. Impala and Cloudera Search workloads can also potentially benefit from faster drives, but for those applications the ideal architecture is to maintain as much data in memory as possible.

Cloudera does not support more than 100 TB per HDFS data node. You can use 12 x 8 TB spindles or 24 x 4 TB spindles. Cloudera does not support drives larger than 8 TB for HDFS data.