Erasure coding overview
Data durability describes how resilient data is to loss. When data is stored in HDFS, CDP provides two options for data durability. You can use replication, which HDFS was originally built on, or Erasure Coding (EC).
- Replication
- HDFS creates two copies of data, resulting in three total instances of data. These
copies are stored on separate DataNodes to guard against data loss when a node is
unreachable. When the data stored on a node is lost or inaccessible, it is replicated from
one of the other nodes to a new node so that there are always multiple copies. The number
of replications is configurable, but the default is three. Cloudera recommends keeping the
replication factor to at least three when you have three or more DataNodes. A lower
replication factor leads to a situation where the data is more vulnerable to DataNode
failures since there are fewer copies of data spread out across fewer DataNodes..
When data is written to an HDFS cluster that uses replication, additional copies of the data are automatically created. No additional steps are required.
Replication supports all data processing engines that CDP supports.
- Erasure Coding (EC)
- EC is an alternative to replication. When an HDFS cluster uses EC,
no additional direct copies of the data are generated. Instead, data
is striped into blocks and encoded to generate parity blocks. If
there are any missing or corrupt blocks, HDFS uses the remaining
data and parity blocks to reconstruct the missing pieces in the
background. This process provides a similar level of data durability
to 3x replication but at a lower storage cost.
Additionally, EC is applied when data is written. This means that to use EC, you must first create a directory and configure it for EC. Then, you can either replicate existing data or write new data into this directory.
EC supports the following data processing engines:- Hive
- MapReduce
- Spark
With both data durability schemes, replication and EC, recovery happens in the background and requires no direct input from a user.
HDFS clusters can be configured with a single data durability scheme (3x replication or EC), or with a hybrid data durability scheme where EC enabled directories co-exist on a cluster with other directories that are protected with the traditional 3x replication model. This decision should be based on the temperature of the data (how often the data is accessed) stored in HDFS. Hot data, data that is accessed frequently, should use replication. Cold data, data that is accessed less frequently, can take advantage of EC's storage savings.