CDP has multiple deployment methodologies in both public cloud and on-premises. These form factors have multiple, interrelated mechanisms for storing data at various organizational levels within the cluster. Each layer has specific requirements related to how and whether data should be replicated. These layers include local file storage on individual nodes, HDFS, and Ozone.
HDFS must be configured for high availability (HA). Three components that enable high availability of data on HDFS are Namenode, Journalnode, and Datanode services. Multiple cluster leader nodes run both Namenode and JournalNode services to maintain the filesystem metadata. Datanode services run on each worker node, maintaining the individual data blocks and providing data availability through block replication.
For information about enabling high availability on HDFS, see Enabling HA on HDFS with Cloudera Manager.
HDFS utilizes the DistCp tool to replicate data between clusters. CDP utilizes Replication Manager within Cloudera Manager to manage the replication policies for a given directory structure in HDFS. Replication Manager utilizes a version of the DistCp tool specific for Cloudera, to integrate with CDP cluster management. Replication policies are triggered on a periodic basis, providing an asynchronous replication path for data on HDFS. HDFS supports replicating encrypted data safely across clusters. HDFS Transparent Data Encryption must be enabled for encrypting data at rest.
For information about utilizing Replication Manager for HDFS data replication, see HDFS Replication
Ozone must be configured for high availability. Three components that enable high availability for data on Ozone are Ozone Manager, Storage Container Manager, and Datanode services. The Ozone Manager and Storage Container Manager run on multiple leader nodes to maintain availability of the object store metadata and background operational tasks. Datanode services run on each Ozone worker node, maintaining individual data blocks and providing data availability through replication.
Backup can be achieved by using DistCp to replicate data between separate clusters. Currently, this is facilitated through direct use of the DistCp tool and manually scheduling these through an orchestration tool like Oozie.
Local file system
You may have local file system synchronization requirements between the two disaster recovery clusters. These areas are outside the scope of this document and CDP replication capabilities do not cover this use case.