Enabling Erasure Coding

Before You Begin

Before you enable Erasure Coding (EC), perform the following tasks:

  • Note the limitations for EC.
  • Verify that the clusters run CDH 6.1.0 or higher.
  • Determine which EC policy you want to use: Understanding Erasure Coding Policies
  • Determine if you want to use EC for existing data or new data. If you want to use EC for existing data, you need to replicate that data with distcp or BDR.
  • Verify that your cluster setup meets the rack and node requirements described in Best Practices for Rack and Node Setup for EC.

Limitations

EC does not support the following:
  • XOR codecs
  • Certain HDFS functions: hflush, hsync, concat, setReplication, truncate and append. For more information, see Erasure Coding Limitations.

Using Erasure Coding for Existing Data

The procedure described in this section explains how to use EC for existing data. To customize or tune EC, see Advanced Erasure Coding Configuration.

To use EC for existing data, complete the following procedure:
  1. Create a new directory or choose an existing directory.
  2. View the supported EC policies with the following command:
    hdfs ec -listPolicies
  3. Enable a supported EC policy:
    hdfs ec -enablePolicy -policy <policy>
  4. Set the EC policy for the directory you want to use with the following command:
    hdfs ec -setPolicy -path <directory> [-policy <policyName>]
    • path. Required. Specify the HDFS directory you want to apply the EC policy to.
    • policy. Optional. The EC policy you want to use for the directory you specified. If you do not provide this parameter, the EC policy you specified in the Default Policy when Setting Erasure Coding setting from Cloudera Manager is used.
  5. Copy the data to the directory you set an EC policy for. You can use the distcp tool or Cloudera Manager's Backup and Disaster Recovery process.

Using Erasure Coding for New Data

The procedure described in this section explains how to use EC with new data. To customize or tune EC, see Advanced Erasure Coding Configuration.

To use EC for new data, complete the following steps:

  1. Create a new directory or choose an existing directory.
  2. View the supported EC policies with the following command:
    hdfs ec -listPolicies
  3. Enable a supported EC policy:
    hdfs ec -enablePolicy -policy <policy>
  4. Set the EC policy for the directory you want to use with the following command:
    hdfs ec -setPolicy -path <directory> [-policy <policyName>]
    • path. Required. Specify the HDFS directory you want to apply the EC policy to.
    • policy. Optional. The EC policy you want to use for the directory you specified. If you do not provide this parameter, the EC policy you specified in the Fallback Erasure Coding Policy setting from Cloudera Manager is used.
  5. Set the destination for the data to the director you enabled EC for. No action beyond that is required. When data is written to the directory, it will be erasure coded based on the policy you set.

Using ISA-L

Intel Intelligent Storage Acceleration Library (ISA-L) is an open-source collection of optimized low-level functions used for storage applications. The library can improve EC performance when the Reed-Solomon (RS) codecs are used. ISA-L is packaged and shipped with CDH. Additionally it is enabled by default. .

You can verify that it is being used by running the following command:

hadoop checknative

The command returns the libraries that the cluster uses. ISA-L should be listed as one of the enabled libraries. The following example shows what the command might return:

18/07/06 11:06:08 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
18/07/06 11:06:08 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /opt/cloudera/parcels/CDH-6.x-1.cdh6.x.p0.460066/lib/hadoop/lib/native/libhadoop.so.1.0.0
zlib: true /lib64/libz.so.1
zstd : false
snappy: true /opt/cloudera/parcels/CDH-6.x-1.cdh6.x.p0.460066/lib/hadoop/lib/native/libsnappy.so.1
lz4: true revision:10301
bzip2: true /lib64/libbz2.so.1
openssl: true /lib64/libcrypto.so
ISA-L: true /opt/cloudera/parcels/CDH-6.x-1.cdh6.x.p0.460066/lib/hadoop/lib/native/libisal.so.2

Note the last line that indicates ISA-L: true /opt/cloudera/parcels/CDH-6.x-1.cdh6.x.p0.460066/lib/hadoop/lib/native/libisal.so.2

Advanced Erasure Coding Configuration

The information in this section is optional. All the required steps are described in Using Erasure Coding for Existing Data or Using Erasure Coding for New Data.

You can customize the behavior of EC through a combination of the hdfs ec subcommand and the Cloudera Manager Admin Console.

For information about the hdfs ec see the Erasure Coding Administrative Commands documentation.

You can configure certain properties for EC with the Cloudera Manager Admin Console:

  1. Select Clusters and choose the HDFS cluster you want to configure.
  2. Navigate to the Configuration tab and select the Erasure Coding category.
  3. Configure the advanced EC properties.
    • DataNode Striped Read Timeout: The timeout for striped reads during background data reconstruction in milliseconds.
    • DataNode Striped Read Threads: The number of threads that a DataNode can use during background data reconstruction.
    • Erasure Coding Reconstruction Weight: The relative weight of resources used by EC for data recovery. The number of blocks that must be read is based on the EC policy used. For example, RS-6-3-1024k requires six blocks to be read. Replication only requires one block to be read. Higher values result in fewer reconstruction tasks being able to run concurrently. The number of blocks required to be read to recover data is multiplied by this weight to determine the total weight of the recovery task. The total weight of the recovery task counts against the limit set with the dfs.namenode.replication.max-streams property.
    • Fallback Erasure Coding Policy: The fallback Erasure Coding policy that HDFS uses if no policy is specified when you run the -setPolicy command.