HDFS replication policy considerations
Before you create an HDFS replication policy, you must understand how source data
gets affected if the source data is added or deleted during replication, network latency
issues, performance and scalability limitations, snapshot diff-based replication guidelines,
and when to bypass Sentry ACLs during replication.
Source data during replication When a replication job runs, ensure that the source directory is not modified. A file added during replication does not get replicated. If you delete a file during replication, the replication fails.Network latency during replication High latency among clusters can cause replication jobs to run more slowly, but does not cause them to fail.Performance and scalability limitations for HDFS replication policies HDFS replication has some performance and scalability limitations.Guidelines for using snapshot diff-based replication By default, Replication Manager uses snapshot differences ("diff") to improve performance by comparing HDFS snapshots and only replicating the files that are changed in the source directory. While Hive metadata requires a full replication, the data stored in Hive tables can take advantage of snapshot diff-based replication. HDFS replication from Sentry-enabled clusters When you run a HDFS replication policy on a Sentry-enabled source cluster, the replication policy copies files and tables along with their permissions. Cloudera Manager version 6.3.1 and above is required to run HDFS replication policies on a Sentry-enabled source cluster.