Using Snapshots with Replication
Some replications, especially those that require a long time to finish, can fail because source files are modified during the replication process. You can prevent such failures by using Snapshots in conjunction with Replication. This use of snapshots is automatic with CDH versions 5.0 and higher. To take advantage of this, you must enable the relevant directories for snapshots (also called making the directory snapshottable).
When the replication job runs, it checks to see whether the specified source directory is snapshottable. Before replicating any files, the replication job creates point-in-time snapshots of these directories and uses them as the source for file copies. This ensures that the replicated data is consistent with the source data as of the start of the replication job. The replication job deletes these snapshots after the replication is complete.
A directory is snapshottable because it has been enabled for snapshots, or because a parent directory is enabled for snapshots. Subdirectories of a snapshottable directory are included in the snapshot. To enable an HDFS directory for snapshots (to make it snapshottable), see Enabling HDFS Snapshots.
Hive Replication with Snapshots
If you are using Hive Replication, Cloudera recommends that you make the Hive Warehouse Directory snapshottable. The Hive Warehouse directory is located in the HDFS file system in the location specified by the hive.metastore.warehouse.dir property (the default location is /user/hive/warehouse).
If you are using external tables in Hive, also make the directories hosting any external tables not stored in the Hive warehouse directory snapshottable.
Similarly, if you are using Cloudera Impala and are replicating any Impala tables using Hive replication, ensure that the storage locations for the tables and associated databases are also snapshottable. See Enabling HDFS Snapshots.