HDFS replication policies

You can use HDFS replication policies in CDP Private Cloud Data Services Replication Manager to copy or replicate HDFS files and directories between CDP Private Cloud Base 7.1.8 or higher clusters. Before you create HDFS replication policies, you must be aware of the guidelines related to the replication process and limitations related to HDFS replication policies.

Guidelines

To replicate HDFS data successfully using HDFS replication policies, you must ensure the following guidelines are followed during replication:

  • All the source files in the directory are closed. This is because replication fails if the source files are open.
  • Source directory is not modified during replication. This is because the files that are added during replication do not get replicated, and if an existing file is deleted during replication, the replication fails.

  • Log files are closed before the next replication job is initiated. This is because the log files are updated during replication.

  • Maintain the latency between the source cluster NameNode and the destination cluster NameNode to less than 80 milliseconds for best performance. This is because of high latency among clusters might cause replication jobs to run slowly, but the job does not fail. You can test latency using the Linux ping command.

Limitations

  • Maximum of 100 million files can be handled by a single replication job.

  • Maximum of 10 million files can be handled by a replication policy that runs more frequently than once in 8 hours.

  • Throughput of the replication job depends on the absolute read and write throughput of the source and destination clusters.