Snapshot Replication using HDP Clusters
You can optionally enable HDFS snapshots for replication in Replication Manager. Understanding how snapshots work, and some of the benefits and costs involved, can help you to decide whether or not to enable snapshot replication.
Understanding HDFS Snapshots
HDFS snapshots are read-only point-in-time copies of the filesystem. You can enable snapshots on the entire filesystem, or on a subtree of the filesystem. For Replication Manager, you enable snapshots at a dataset level.
Enabling snapshots on a folder requires HDFS admin permissions, because it impacts the NameNode. When you enable snapshots, all subdirectories are automatically enabled for snapshots as well. So when you create a snapshot copy of a directory, all content in that directory, including subdirectories, is included as part of the copy. If a directory contains snapshots but the directory is no longer snapshot-enabled, you must delete the snapshots prior to enabling the snapshot capability on the directory.
Snapshots must be taken on the highest-level parent directory that is snapshot-enabled. Snapshot operations are not allowed on a directory if one of its parent directories is already snapshot-enabled (snapshottable) or if descendants already contain snapshots.
For example, in the directory tree image below, if directory-1 is snapshot-enabled but you want to replicate subdirectory-2, you cannot select only subdirectory-2 for replication. You must select directory-1 for your replication policy.
There is no limit to the number of snapshot-enabled directories you can have. A snapshot-enabled directory can accommodate 65,536 simultaneous snapshots.
Blocks in datanodes are not copied during snapshot replication. The snapshot files record the block list and the file size. There is no data copying.
When snapshots are initially created, a directory named .snapshot is created on the source and destination clusters, under the directory being copied. All snapshots are retained within .snapshot directories. By default, the last three snapshots of a file or directory are retained. Snapshots older than the last three are automatically deleted.
Requirements for snapshot-based replication
You must have HDFS superuser privilege to enable or disable snapshot operations.
Replication using snapshots requires that the target filesystem data being replicated is identical to the source data for a given snapshot. There must not be any modification to the data on the target. Otherwise, the integrity of the snapshot cannot be guaranteed on the target and replication can fail in various ways.
Benefits of snapshots
Snapshot-based replication helps you to avoid unnecessarily copying renamed files and directories. If a large directory is renamed on the source side, a regular DistCp update operation sees the renamed directory as a new one and copies the entire directory.
Generating copy lists during incremental synchronization is more efficient with snapshots than using a regular DistCp update, which can take a long time to scan the whole directory and detect identical files. And because snapshots are read-only point-in-time copies between the source and destination, modification of source files during replication is not an issue, as it can be using other replication methods.
A snapshot cannot be modified. This protects the data against accidental or intentional modification, which is helpful in governance.
Considerations for using snapshots
There is a memory cost to enabling and maintaining snapshots. Tracking the modifications that are made relative to a snapshot increases the memory footprint on the NameNode and can therefore stress NameNode memory.
Because of the additional memory requirements, snapshot replication is recommended for situations in which it is most useful. Such circumstance might include: if you expect to do a lot of directory renaming, if the directory tree is very large, or if you expect changes to be made to source files while replication jobs execute.
In the Additional Settings page of the Create Replication Policy wizard, you must select Enable snapshot based replication. To perform the snapshot-based replication, you must be running HDP 2.6.5 or higher, or any supported HDP 3.x version.
The image below shows the settings in Replication Manager for including snapshot-based replication in the replication policy.
Considerations for HDP 2.6.5: HDP 2.6.5 release does not support snapshots in the true sense. In case of open files, the snapshots do not have point in time copy of the same and this can lead to data loss. The snapshot-diff based replication is disabled in HDP 2.6.5. If you enable snapshot-based replication in HDP 2.6.5, the snapshots are used as the source, without using the diff-based replication. While using snapshot as a source during replication and if any modification is performed in the source, say file deletion, the replication process shall continue without throwing any file not found exception.