Replication Manager in CDP Private Cloud Base

CDP Private Cloud Base Replication Manager is a service in Cloudera Manager. You can create replication policies in this service to replicate data across data centers for various use cases which include disaster recovery scenarios, running hybrid workloads, migrating data to/from cloud, or a generic backup/restore scenario. You can also create HDFS, HBase, or Ozone snapshot policies to take snapshots of HDFS directories, HBase tables, or Ozone buckets respectively.

Cloudera Manager provides the following key functionalities in the Cloudera Manager Admin Console that can be leveraged by Replication Manager:

  • Select datasets that are critical for your business operations.
  • Monitor and track progress of your snapshots and replication jobs through a central console and easily identify issues or files that failed to be transferred.
  • Issue Alert when a snapshot or replication job fails or is aborted so that the problem can be diagnosed quickly.

You can also perform a dry run of the replication policy to verify the configuration and to understand the cost of the overall operation before actually copying the entire dataset.

Replication Manager provides the following functionalities that you can use to accomplish your data replication goals:

Atlas replication policies

These replication policies replicate the metadata and data lineage of all the Hive external tables, Iceberg tables, and any other Atlas supported entities between CDP Private Cloud Base 7.1.9 SP1 clusters using Cloudera Manager 7.11.3 CHF7 or higher. During an Atlas replication policy run, Replication Manager exports the Atlas metadata and data lineage to a staging directory in the target cluster, and then imports into the target cluster. You can enter the required staging directory during the replication policy creation process.

Some use cases where you can use Atlas replication policies include:

  • Disaster recovery scenarios. You can back up the Atlas metadata and data lineage periodically, and restore it to the same cluster or a different cluster as required.
  • High availability scenarios.
  • Prevent accidental access of Ranger policies and Atlas metadata for specific Hive external tables and Iceberg tables. You can accomplish this by running both Ranger, Hive external table, and Iceberg replication policies on the required tables in the disaster-recovery cluster. The replication policies replicate the data and its associated metadata and access controls.

HDFS replication policies

These policies replicate HDFS data and metadata from CDH (version 5.10 and higher) clusters to CDP Private Cloud Base (version 7.0.3 and higher) clusters.

Some use cases where you can use HDFS replication policies include:

  • copying data from legacy on-premises systems to Amazon S3, Microsoft ADLS Gen2 (ABFS), and GCP, or from cloud buckets to on-premise systems.
  • replicating required data to another cluster to run load-intensive workflows on it which optimizes the primary cluster performance.
  • deploying a complete backup-restore solution for your enterprise.

Hive external table replication policies

These policies replicate HDFS, Hive external tables (without manual translation of Hive datasets to HDFS datasets, or vice versa), Hive metastore data, Impala metadata (catalog server metadata) associated with Impala tables registered in the Hive metastore, Impala data, and Sentry permissions to Ranger from CDH (version 5.10 and higher) clusters to CDP Private Cloud Base (version 7.0.3 and higher) clusters. In this instance, applications that depend on external table definitions stored in Hive, operate on both replica and source as the table definitions are updated.

Some use cases where you might find these replication policies useful is to:

  • backup legacy data for future use or archive cold data.
  • replicate or move data to cloud clusters to run analytics.
  • implement a complete backup and disaster recovery solution.

Hive ACID table replication policies

These policies replicate HDFS, Hive managed (ACID) data and metadata between CDP Private Cloud Base (version 7.1.8 and higher) clusters using Cloudera Manager version 7.7.1 or higher.

Some use cases where these replication policies can be used by security-conscious organizations such as financial organizations and others is to:

  • replicate non-sensitive data to cloud deployments to use as a backup.
  • migrate data to another cluster to run load-intensive workflows.
  • use the failover functionality to make the disaster recovery cluster as your primary cluster so that the data ingestion being performed by a replication policy is uninterrupted.

Iceberg replication policies

Iceberg replication policies replicate Iceberg tables between CDP Private Cloud Base 7.1.9 or higher clusters using Cloudera Manager 7.11.3 or higher versions.

Iceberg replication policies can:

  • replicate metadata and catalog from the source cluster Hive Metastore (HMS) to target cluster HMS.

    The catalog is an HDFS file that has a list of data files and manifest files to copy from the source cluster to the target cluster. The manifest files contain the metadata for the data files.

  • replicate data files in the HDFS storage system from the source cluster to the target cluster. The Iceberg replication policy can replicate only between HDFS storage systems.
  • replicate all the snapshots from the source cluster by default. This allows you to run time travel queries on the target cluster.
Some use cases where you can use Iceberg replication policies are to:
  • implement disaster recovery by replicating Iceberg tables between on-premises clusters.
  • implement passive disaster recovery with planned failover and incremental replication at regular intervals between two similar systems. For example, between an HDFS to another HDFS system.

Ozone replication policies

You can create Ozone replication policies to replicate data in Ozone buckets between CDP Private Cloud Base 7.1.8 clusters or higher using Cloudera Manager 7.7.1 or higher.

Ozone replication policies support data replication between:
  • FSO buckets in source and target clusters using ofs protocol.
  • legacy buckets in source and target clusters using ofs protocol.
  • OBS buckets in source and target clusters that support S3A filesystem using the S3A scheme or replication protocol.

You can use these policies to replicate or migrate the required Ozone data to another cluster to run load-intensive workloads, back up data, or for backup-restore use cases.

Ranger replication policies

The Ranger replication policies migrate the Ranger policies and roles for HDFS, Hive, and HBase services between Kerberos-enabled CDP Private Cloud Base 7.1.9 or higher clusters using Cloudera Manager 7.11.3. It can also migrate Ranger audit logs in HDFS.

Some use cases where you can use Ranger replication policies are:

  • when Ranger is used for file system-level access control for HDFS and Hive and you want to copy the Ranger policies to another cluster for backup purposes.
  • when you want to move/replicate Ranger policies for Hive (SQL) or HBase data to another cluster for disaster recovery purposes.

HDFS, HBase, and Ozone snapshot policies

The HDFS, HBase, or Ozone snapshot policies take regular point-in-time snapshots of HDFS directories, HBase tables, or Ozone buckets respectively.

Snapshots act as a backup, and you can restore an HDFS directory, HBase table, or Ozone bucket to a previous version or to another location on the same HDFS, HBase, or Ozone service as necessary. Snapshots are also used by HDFS, Hive, and Ozone replication policies. The first replication policy run replicates all the data and metadata from the chosen directories. The subsequent replication policy runs leverage snapshot-diffs to replicate the changed data.