How Replication Manager uses snapshots

You can create snapshot policies in CDP Private Cloud Base Replication Manager to take HDFS and Ozone snapshots at regular intervals. HDFS and Hive replication policies leverage HDFS snapshots and Ozone replication policies leverage Ozone snapshots to implement incremental data replication.

You can also create HBase snapshot policies to create HBase snapshots at regular intervals in Replication Manager. There are several use cases that leverage HBase snapshots. For more information, see HBase snapshot use cases.

HBase snapshots are enabled for all HBase tables by default. HBase snapshots are point-in-time backup of tables, without making data copies, and with minimal impact on RegionServers. HBase snapshots are supported for clusters running CDH 4.2 or higher. You can also create an HBase snapshot using Cloudera Operational Database (COD) CLI.

HDFS snapshots

Understand what HDFS snapshots are and how it helps Replication Manager during replication.

HDFS snapshots are point-in-time backup of directories without actually cloning of data. HDFS snapshots improve data replication performance and prevent errors caused by changes to a source directory. These snapshots appear on the filesystem as read-only directories that can be accessed just like other ordinary directories.

A directory is called snapshottable after it has been enabled for snapshots, or if a parent directory is enabled for snapshots. Subdirectories of a snapshottable directory are included in the snapshot.

For more information, see the HDFS Snapshots Best Practices blog.

Some replications, especially those that require a long time to finish can fail because source files are modified during the replication process. You can prevent such failures by using snapshot policies in Replication Manager. This use of snapshots is automatic with CDH versions 5.0 and higher. To take advantage of this, you must enable the relevant directories for snapshots (also called making the directory snapshottable).

When the replication job runs, it checks to see whether the specified source directory is snapshottable. Before replicating any files, the replication job creates point-in-time snapshots of these directories and uses them as the source for file copies. This ensures that the replicated data is consistent with the source data as of the start of the replication job. The latest snapshot for the subsequent runs is retained after the replication process is completed.

For more information, see Using HDFS snapshots.

Hive/Impala replication using snapshots

Before you create Hive external table replication policies, ensure that you enable snapshots for the databases and directories that contain the required external tables. Before you replicate Impala tables, ensure that the storage locations for the tables and associated databases are also snapshottable.

For example, if the database resides in a custom location, such as /apps/folder1/folder2/[sales.db, marketing.db, hr.db, etc.], you can enable the snapshots at the following database or directory levels depending on your requirement:
  • /apps/folder1/folder2/sales.db
  • /apps/folder1/folder2/marketing.db
  • /apps/folder1/folder2/hr.db

You can also isolate the database-level snapshots from each other so that the Hive external table replication policy replicates only the specified database.

The following table shows sample custom locations that contain the external tables and the recommended directory level to enable snapshots to isolate the database-level snapshots:
Sample custom location of external tables Recommended directory level to enable snapshots
/data/folder1/folder2/sales/[table1, table2, table3 ... tablen] /data/folder1/folder2/sales
/data/folder1/folder2/marketing/[table1, table2, table3 ... tablen] /data/folder1/folder2/marketing
/data/folder1/folder2/hr/[table1, table2, table3 ... tablen] /data/folder1/folder2/hr

Orphaned snapshots

When you edit or delete a snapshot policy, the snapshots for the files, directories, or tables that were removed from the snapshot policy are retained. These are known as orphaned snapshots. These snapshots are not deleted automatically because they are no longer associated with a snapshot policy.

You can identify and delete these orphaned snapshots manually through Cloudera Manager, or by creating a command-line script that uses the HDFS or HBase snapshot commands.

To avoid orphaned snapshots, you can choose one of the following methods depending on your requirements.
  • Delete the snapshots before you edit or delete the associated snapshot policy.

    Cloudera Manager assigns the prefix cm-auto which is followed by a globally unique identifier (GUID) for every HDFS snapshot policy. You can view the snapshot prefix in the policy summary in the policy list, and in the delete modal window.

  • Identify the orphaned snapshots for a deleted snapshot policy using its cm-auto-guid, and delete the snapshots.

Ozone snapshots and replication methods

Understand what Ozone snapshots are and what you can replicate with Ozone snapshots. Also, learn about the replication methods you can choose for Ozone replication policies to replicate data.

What Ozone snapshots are

Ozone snapshots are point-in-time backups of buckets and volumes within it, without actually cloning the data. You can leverage snapshots and snapshot-diffs to implement incremental replication in Ozone replication policies.

Ozone data replication methods

Ozone snapshots are enabled by default for all the buckets and volumes. If the incremental replication feature is also enabled on the source and target clusters, you can choose one of the following methods to replicate Ozone data during the Ozone replication policy creation process:

Full file listing
By default, the Ozone replication policies use the full file listing method which takes a longer time to replicate data. In this method, the first Ozone replication policy job run is a bootstrap job; that is, all the data in the chosen buckets are replicated. During subsequent replication policy runs, Replication Manager performs the following high-level steps:
  1. Lists all the files.
  2. Performs a checksum and metadata check on them to identify the relevant files to copy. This step depends on the advanced options you choose during the replication creation process. During this identification process, some unchanged files are skipped if they do not meet the criteria set by the chosen advanced options.
  3. Copies the identified files from the source cluster to the target cluster.
Incremental only
In this method, the first replication policy job run is a bootstrap job, and subsequent job runs are incremental jobs.
To perform the incremental job, Replication Manager leverages Ozone snapshots and the snapshot-diff capability to generate a diff report. The diff report contains the changed or new data from the source cluster. The subsequent replication policy replicates data based on the diff report.
Incremental with fallback to full file listing
In this method, the first replication policy job run is a bootstrap job, and subsequent job runs are incremental jobs. However, if the snapshot-diff fails during a replication policy job run, the next job run is a full file listing run. After the full file listing run succeeds, the subsequent runs are incremental runs. This method takes a longer time to replicate data if the replication policy job falls back to the full file listing method.