How Replication Manager uses snapshots

You can create snapshot policies in CDP Private Cloud Base Replication Manager to take HDFS and Ozone snapshots at regular intervals. HDFS and Hive replication policies leverage HDFS snapshots and Ozone replication policies leverage Ozone snapshots to implement incremental data replication.

You can also create HBase snapshot policies to create HBase snapshots at regular intervals in Replication Manager. There are several use cases that leverage HBase snapshots. For more information, see HBase snapshot use cases.

HBase snapshots are enabled for all HBase tables by default. HBase snapshots are point-in-time backup of tables, without making data copies, and with minimal impact on RegionServers. HBase snapshots are supported for clusters running CDH 4.2 or higher. You can also create an HBase snapshot using Cloudera Operational Database (COD) CLI.

HDFS snapshots

Understand what HDFS snapshots are and how Replication Manager helps with successful HDFS replication.

HDFS snapshots are point-in-time backup of directories or the entire filesystem without actually cloning of data. HDFS snapshots improve data replication performance and prevent errors caused by changes to a source directory. These snapshots appear on the filesystem as read-only directories that can be accessed just like other ordinary directories.

Some replications, especially those that require a long time to finish can fail because source files are modified during the replication process. You can prevent such failures by using snapshot policies in Replication Manager. This use of snapshots is automatic with CDH versions 5.0 and higher. To take advantage of this, you must enable the relevant directories for snapshots (also called making the directory snapshottable).

When the replication job runs, it checks to see whether the specified source directory is snapshottable. Before replicating any files, the replication job creates point-in-time snapshots of these directories and uses them as the source for file copies. This ensures that the replicated data is consistent with the source data as of the start of the replication job. The latest snapshot for the subsequent runs is retained after the replication process is completed.

A directory is snapshottable because it has been enabled for snapshots, or because a parent directory is enabled for snapshots. Subdirectories of a snapshottable directory are included in the snapshot.

For more information, see Using HDFS snapshots.

Ozone snapshots and replication methods

Understand what Ozone snapshots are and what you can replicate with Ozone snapshots. Also, learn about the replication methods you can choose for Ozone replication policies to replicate data.

What Ozone snapshots are

Ozone snapshots are point-in-time backups of buckets and volumes within it, without actually cloning the data. You can leverage snapshots and snapshot-diffs to implement incremental replication in Ozone replication policies.

Ozone data replication methods

Ozone snapshots are enabled by default for all the buckets and volumes. If the incremental replication feature is also enabled on the source and target clusters, you can choose one of the following methods to replicate Ozone data during the Ozone replication policy creation process:

Full file listing
By default, the Ozone replication policies use the full file listing method which takes a longer time to replicate data. In this method, the first Ozone replication policy job run is a bootstrap job; that is, all the data in the chosen buckets are replicated. During subsequent replication policy runs, Replication Manager performs the following high-level steps:
  1. Lists all the files.
  2. Performs a checksum and metadata check on them to identify the relevant files to copy. This step depends on the advanced options you choose during the replication creation process. During this identification process, some unchanged files are skipped if they do not meet the criteria set by the chosen advanced options.
  3. Copies the identified files from the source cluster to the target cluster.
Incremental only
In this method, the first replication policy job run is a bootstrap job, and subsequent job runs are incremental jobs.
To perform the incremental job, Replication Manager leverages Ozone snapshots and the snapshot-diff capability to generate a diff report. The diff report contains the changed or new data from the source cluster. The subsequent replication policy replicates data based on the diff report.
Incremental with fallback to full file listing
In this method, the first replication policy job run is a bootstrap job, and subsequent job runs are incremental jobs. However, if the snapshot-diff fails during a replication policy job run, the next job run is a full file listing run. After the full file listing run succeeds, the subsequent runs are incremental runs. This method takes a longer time to replicate data if the replication policy job falls back to the full file listing method.

Hive/Impala replication using snapshots

If you are using Hive external table replication, Cloudera recommends that you make the Hive Warehouse Directory snapshottable.

The Hive warehouse directory is located in the HDFS file system in the location specified by the hive.metastore.warehouse.dir property. The default location is /user/hive/warehouse.

To locate the Hive warehouse directory, perform the following steps:
  1. Go to the Cloudera Manager > HDFS service > Configuration tab.
  2. Search for hive.metastore.warehouse.dir property to view the location of the directory.

After you locate the directory, enable snapshots for the directory.

If you are using external tables in Hive, also make the directories hosting any external tables not stored in the Hive warehouse directory snapshottable.

Similarly, if you are using Impala and are replicating any Impala tables using Hive/Impala replication, ensure that the storage locations for the tables and associated databases are also snapshottable.

Orphaned snapshots

When a snapshot policy includes a limit on the number of snapshots to keep, Cloudera Manager checks the total number of stored snapshots each time a new snapshot is added, and automatically deletes the oldest existing snapshot if necessary.

When a snapshot policy is edited or deleted, files, directories, or tables that were removed from the policy may leave "orphaned" snapshots behind that are not deleted automatically because they are no longer associated with a current snapshot policy. Cloudera Manager never selects these snapshots for automatic deletion because selection for deletion only occurs when the policy creates a new snapshot containing those files, directories, or tables.

You can delete snapshots manually through Cloudera Manager or by creating a command-line script that uses the HDFS or HBase snapshot commands. Orphaned snapshots can be hard to locate for manual deletion. Snapshot policies automatically receive the prefix cm-auto followed by a globally unique identifier (GUID). You can locate all snapshots for a specific policy by searching for t the prefix cm-auto-guid that is unique to that policy.

To avoid orphaned snapshots, delete snapshots before editing or deleting the associated snapshot policy, or record the identifying name for the snapshots you want to delete. This prefix is displayed in the summary of the policy in the policy list and appears in the delete dialog box. Recording the snapshot names, including the associated policy prefix, is necessary because the prefix associated with a policy cannot be determined after the policy has been deleted, and snapshot names do not contain recognizable references to snapshot policies.