Using Snapshots with Replication
Some replications, especially those that require a long time to finish, can fail because source files are modified during the replication process. You can prevent such failures by using Snapshots in conjunction with Replication. This use of snapshots is automatic with CDH versions 5.0 and higher. To take advantage of this, you must enable the relevant directories for snapshots (also called making the directory snapshottable). To improve performance, ensure that you do not enable snapshots at the root directory level.
When the replication job runs, it checks to see whether the specified source directory is snapshottable. Before replicating any files, the replication job creates point-in-time snapshots of these directories and uses them as the source for file copies. This ensures that the replicated data is consistent with the source data as of the start of the replication job. The latest snapshot for the subsequent runs is retained after the replication process is completed in the source cluster and is not replicated.
A directory is snapshottable because it has been enabled for snapshots, or because a parent directory is enabled for snapshots. Subdirectories of a snapshottable directory are included in the snapshot. To enable an HDFS directory for snapshots (to make it snapshottable), see Enabling and Disabling HDFS Snapshots.
Hive/Impala Replication with Snapshots
- Open Cloudera Manager and browse to the Hive service.
- Click the Configuration tab.
- In the Search box, type hive.metastore.warehouse.dir.
The Hive Warehouse Directory property displays.
If you are using external tables in Hive, also make the directories hosting any external tables not stored in the Hive warehouse directory snapshottable.
Similarly, if you are using Impala and are replicating any Impala tables using Hive/Impala replication, ensure that the storage locations for the tables and associated databases are also snapshottable. See Enabling and Disabling HDFS Snapshots.