Data Durability

Overview

CDH provides two options for data durability, how resilient data is to loss, when data is stored in HDFS. You can use 3x replication, which HDFS was originally built on, or Erasure Coding (EC). With 3x replication, HDFS creates two copies of data, resulting in three total instances of data. These copies are stored on separate DataNodes to guard against data loss when a node is unreachable. When the data stored on a node is lost or inaccessible, it is replicated from one of the other nodes to a new node so that there are always three copies. The number of replications is configurable, but the default is three. Cloudera recommends keeping the replication factor to at least three when you have three or more DataNodes. A lower replication factor can lead to data loss.

EC is an alternative to the 3x replication scheme. When an HDFS cluster uses EC, no additional copies of the data are generated. Instead, data is striped into blocks and encoded to generate parity blocks. If there is data missing or corrupt, HDFS uses the remaining data and parity blocks to reconstruct the missing pieces in the background. This process provides a similar level of data durability to replication but at a lower storage cost. With both data protection schemes, replication and EC, recovery happens in the background and requires no direct input from a user.

EC can be the only data protection policy in effect or it can be used in conjunction with 3x data replication in a sort of hybrid deployment. This decision should be based on the temperature of the data (how often the data is accessed) that is stored in HDFS. Additionally, EC is applied when data is written. This means that to use EC, you must replicate existing data to directories with EC set as the policy or write new data to directories with EC set as the policy. For 3x replication, there are no additional steps required.

Understanding Erasure Coding Policies

The EC policy determines how data is encoded and decoded. An EC policy is made up of the following parts: codec-number of data blocks-number of parity blocks-cell size.

  • Codec: The erasure codec that the policy uses. It can be XOR or Reed-Solomon (RS)
  • Number of Data Blocks: The number of data blocks per stripe. The higher this number, the more nodes that need to be read when accessing data because HDFS attempts to distribute the blocks evenly across DataNodes.
  • Number of Parity Blocks: The number of parity blocks per stripe. Even if a file does not use up all the data blocks available to it, the number of parity blocks will always be the total number listed in the policy.
  • Cell Size: The size of one basic unit of striped data.

For example, a RS-6-3-1024k policy has the following attributes:

  • Codec: Reed-Solomon
  • Number of Data Blocks: 6
  • Number of Parity Blocks: 3
  • Cell Size: 1024k

The sum of the number of data blocks and parity blocks is the data stripe width. When you make hardware plans for your cluster, the number of racks should at least equal the stripe width in order for the data to be resistant to rack failures. Ideally, the number of racks exceeds the data stripe width to account for downtime and outages. If there are fewer racks than the data stripe width, HDFS spreads data blocks across multiple nodes to maintain fault tolerance at the node level. When distributing blocks to racks, HDFS attempts to distribute the blocks evenly across all racks. Because of this behavior, Cloudera recommends setting up each rack with a similar number of DataNodes. Otherwise, racks with fewer DataNodes may be filled up faster than racks with more DataNodes.

To achieve node-level fault tolerance, the number of nodes needs to equal the data stripe width. For example, in order for a RS-6-3-1024k policy to be node failure tolerant, you need at least 9 nodes. For rack-level fault tolerance, spread the 9 nodes evenly between across three racks. The data and parity blocks, when distributed evenly, lead to the following placement on the racks:
  • Rack 1: Three blocks
  • Rack 2: Three blocks
  • Rack 3: Three blocks

Cloudera recommends at least nine racks though, which leads to one data or parity block on each rack.

A policy with a wide data-stripe width like RS-6-3-1024k comes with a tradeoff though. Data must be read from 6 blocks, increasing the read time. Therefore, the larger the cluster and colder the data, the more appropriate it is to use EC policies with large data stripe widths.

Comparing Replication and Erasure Coding

Consider the following factors when you examine which data protection scheme to use:

Data Temperature
Data temperature refers to how often data is accessed. EC works best with cold data that is accessed and modified infrequently. Replication is more suitable for hot data, data that is accessed and modified frequently.
I/O Cost
EC has higher I/O costs than replication for the following reasons:
  • EC spreads data across nodes and racks, which means reading and writing data comes at a higher cost.
  • A parity block is generated when data is written, thus impacting write speed.
  • If data is missing or corrupt, a DataNode reads the remaining data and parity blocks in order to reconstruct the data. This process requires CPU and network resources.

Cloudera recommends at least a 10GB network connection if you want to use EC.

Storage Cost
EC has a lower storage overhead than replication because multiple copies of data are not maintained. Instead, a number of parity blocks are generated based on the EC policy. For the same amount of data, EC will store fewer blocks than 3x replication in most cases. For example with a RS (10,4) EC, HDFS stores four parity blocks for each set of 10 data blocks. With replication, HDFS stores 12 replica blocks for every six data blocks, the original block and three replicas. The case where 3x replication requires fewer blocks is when dat is stored in small files.
File Size
Erasure coding works best with larger files. The total number of blocks is determined by data blocks + parity blocks, which is the data stripe width discussed earlier.

With RS (6,3), each block group can hold up to (128MB * 6) = 768 MB of data. 128MB is the default block size . Inside each block group, there will be 9 total blocks, 6 data blocks, each holding up to 128MB, and 3 parity blocks. For the best utilization of blocks, the erasure coded file size should be close to 768 MB. For a chunk of data less than the block size, HDFS uses one data block, but the full number of parity blocks are still required. This leads to a situation where erasure coded files will generate more blocks than 3x replication because of the parity blocks required.