Guidelines for snapshot diff-based replication

By default, Replication Manager uses snapshot differences ("diff") to improve performance by comparing HDFS snapshots and only replicating the files that are changed in the source directory.

While Hive metadata requires a full replication, the data stored in Hive tables can take advantage of snapshot diff-based replication.

To use this feature, follow these guidelines:

  • The source and target clusters must be managed by Cloudera Manager 5.15.0 or higher.
  • The source and target clusters run CDH 5.15.0 or higher, 5.14.2 or higher, or 5.13.3 or higher.
  • Verify that HDFS snapshots are immutable.

    In the Cloudera Manager Admin Console, go to Clusters > HDFS service > Configuration section and search for Enable Immutable Snapshots.

  • Do not use snapshot diff for globbed paths. It is not optimized for globbed paths.
  • Set the snapshot root directory as low in the hierarchy as possible.
  • To use the Snapshot diff feature, the user who is configured to run the job, needs to be either a super user or the owner of the snapshottable root, because the run-as-user must have the permission to list the snapshots.
  • Decide if you want Replication Manager to abort on a snapshot diff failure or continue the replication. If you choose to configure Replication Manager to continue the replication when it encounters an error, Replication Manager performs a complete replication. Note that continuing the replication can result in a longer duration since a complete replication is performed.
  • Replication Manager performs a complete replication when one or more of the following change: Delete Policy, Preserve Policy, Target Path, or Exclusion Path.
  • Paths from both source and destination clusters in the replication policy must be under a snapshottable root or should be snapshottable for the policy to run using snapshot diff.
  • If the source data contains an encrypted subdirectory, create an exclusion regex in the replication policy to exclude the subdirectory during replication. Create another replication policy to replicate the encrypted subdirectory. This is because, snapshot diff-based replication might fail if an encrypted subdirectory exists in the source data.
  • If a Hive replication policy is created to replicate a database, ensure all the HDFS paths for the tables in that database are either snapshottable or under a snapshottable root. For example, if the database that is being replicated has external tables, all the external table HDFS data locations should be snapshottable too. Failing to do so will cause Replication Manager to fail to generate a diff report. Without a diff report, Replication Manager will not use snapshot diff.
  • After every replication, Replication Manager retains a snapshot on the source cluster. Using the snapshot copy on the source cluster, Replication Manager performs incremental backups for the next replication cycle. Replication Manager retains snapshots on the source cluster only if:
    • Source and target clusters in the Cloudera Manager are 5.15 and higher
    • Source and target CDH are 5.13.3+, 5.14.2+, and 5.15+ respectively