Guidelines for Snapshot Diff-based Replication

By default, Replication Manager uses snapshot differences ("diff") to improve performance by comparing HDFS snapshots and only replicating the files that are changed in the source directory.

While Hive metadata requires a full replication, the data stored in Hive tables can take advantage of snapshot diff-based replication.

To use this feature, follow these guidelines:

  • If a Hive replication schedule is created to replicate a database, ensure all the HDFS paths for the tables in that database are either snapshottable or under a snapshottable root. For example, if the database that is being replicated has external tables, all the external table HDFS data locations should be snapshottable too. Failing to do so will cause Replication Manager to fail to generate a diff report. Without a diff report, Replication Manager will not use snapshot diff.
  • After every replication, Replication Manager retains a snapshot on the source cluster. Using the snapshot copy on the source cluster, Replication Manager performs incremental backups for the next replication cycle. Replication Manager retains snapshots on the source cluster only if:
    • Source and target clusters in the Cloudera Manager are 5.15 and higher
    • Source and target CDH are 5.13.3+, 5.14.2+, and 5.15+ respectively