Enabling Erasure Coding
This page contains the following topics:
Before You Begin
Before you enable Erasure Coding (EC), perform the following tasks:
- Note the limitations for EC.
- Verify that the clusters run CDH 6.1.0 or higher.
- Determine which EC policy you want to use: Understanding Erasure Coding Policies
- Determine if you want to use EC for existing data or new data. If you want to use EC for existing data, you need to replicate that data with distcp or BDR.
- Verify that your cluster setup meets the rack and node requirements described in Best Practices for Rack and Node Setup for EC.
Limitations
- XOR codecs
- Certain HDFS functions: hflush, hsync, concat, setReplication, truncate and append. For more information, see Erasure Coding Limitations.
Using Erasure Coding for Existing Data
The procedure described in this section explains how to use EC for existing data. To customize or tune EC, see Advanced Erasure Coding Configuration.
- Create a new directory or choose an existing directory.
- View the supported EC policies with the following command:
hdfs ec -listPolicies
- Enable a supported EC policy:
hdfs ec -enablePolicy -policy <policy>
- Set the EC policy for the directory you want to use with the following command:
hdfs ec -setPolicy -path <directory> [-policy <policyName>]
- path. Required. Specify the HDFS directory you want to apply the EC policy to.
- policy. Optional. The EC policy you want to use for the directory you specified. If you do not provide this parameter, the EC policy you specified in the Default Policy when Setting Erasure Coding setting from Cloudera Manager is used.
- Copy the data to the directory you set an EC policy for. You can use the distcp tool or Cloudera Manager's Backup and Disaster Recovery process.
Using Erasure Coding for New Data
The procedure described in this section explains how to use EC with new data. To customize or tune EC, see Advanced Erasure Coding Configuration.
To use EC for new data, complete the following steps:
- Create a new directory or choose an existing directory.
- View the supported EC policies with the following command:
hdfs ec -listPolicies
- Enable a supported EC policy:
hdfs ec -enablePolicy -policy <policy>
- Set the EC policy for the directory you want to use with the following command:
hdfs ec -setPolicy -path <directory> [-policy <policyName>]
- path. Required. Specify the HDFS directory you want to apply the EC policy to.
- policy. Optional. The EC policy you want to use for the directory you specified. If you do not provide this parameter, the EC policy you specified in the Default Policy when Setting Erasure Coding setting from Cloudera Manager is used.
- Set the destination for the data to the director you enabled EC for. No action beyond that is required. When data is written to the directory, it will be erasure coded based on the policy you set.
Using ISA-L
Intel Intelligent Storage Acceleration Library (ISA-L) is an open-source collection of optimized low-level functions used for storage applications. The library can improve EC performance when the Reed-Solomon (RS) codecs are used. ISA-L is packaged and shipped with CDH. Additionally it is enabled by default. .
You can verify that it is being used by running the following command:
hadoop checknative
The command returns the libraries that the cluster uses. ISA-L should be listed as one of the enabled libraries. The following example shows what the command might return:
18/07/06 11:06:08 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native 18/07/06 11:06:08 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library Native library checking: hadoop: true /opt/cloudera/parcels/CDH-6.x-1.cdh6.x.p0.460066/lib/hadoop/lib/native/libhadoop.so.1.0.0 zlib: true /lib64/libz.so.1 zstd : false snappy: true /opt/cloudera/parcels/CDH-6.x-1.cdh6.x.p0.460066/lib/hadoop/lib/native/libsnappy.so.1 lz4: true revision:10301 bzip2: true /lib64/libbz2.so.1 openssl: true /lib64/libcrypto.so ISA-L: true /opt/cloudera/parcels/CDH-6.x-1.cdh6.x.p0.460066/lib/hadoop/lib/native/libisal.so.2
Note the last line that indicates ISA-L: true /opt/cloudera/parcels/CDH-6.x-1.cdh6.x.p0.460066/lib/hadoop/lib/native/libisal.so.2
Advanced Erasure Coding Configuration
The information in this section is optional. All the required steps are described in Using Erasure Coding for Existing Data or Using Erasure Coding for New Data.
You can customize the behavior of EC through a combination of the hdfs ec subcommand and the Cloudera Manager Admin Console.
For information about the hdfs ec see the Erasure Coding Administrative Commands documentation.
You can configure certain properties for EC with the Cloudera Manager Admin Console:
- Select Clusters and choose the HDFS cluster you want to configure.
- Navigate to the Configuration tab and select the Erasure Coding category.
- Configure the advanced EC properties.
- DataNode Striped Read Timeout: DataNode reconstruction striped read timeout in milliseconds.
- DataNode Striped Read Threads: Number of threads used by the DataNode to read striped blocks during background reconstruction work.
- Erasure Coding Reconstruction Weight: Relative weight of resources used by EC background recovery tasks, which require reading multiple blocks, 6 in the case of RS-6-3-1024k, compared to replicated block recovery, which only requires reading a single replica. Higher values result in fewer reconstruction tasks being able to run concurrently. Blocks required to be read to complete recovery are multiplied by this weight to determine the total weight of the recovery task. These units of weight count against the limit set in the dfs.namenode.replication.max-streams property.
- Default Policy when Setting Erasure Coding: The erasure coding policy used when enabling erasure coding for a directory without specifying a policy.