Preparing to create an HBase replication policy
Before you create HBase replication policies in CDP Public Cloud Replication Manager, you must prepare the clusters, register cloud storage in Replication Manager, and verify cluster access.
-
Do the source cluster and target cluster meet the requirements to create an
HBase replication policy?
The following image shows a high-level view of the support matrix for HBase replication policies, you must consult the Support matrix for Replication Manager on CDP Public Cloud for the complete list of supporting clusters and scenarios:
-
Is the source CDH cluster or source CDP Private Cloud Base cluster registered
as a classic cluster on the Management Console?
CDH clusters and CDP Private Cloud Base clusters are managed by Cloudera Manager. To enable these on-premises clusters for Replication Manager, you must register them as classic clusters on the Management Console. After registration, you can use them for data migration purposes.For information about registering an on-premises cluster as a classic cluster, see Adding a CDH cluster and Adding a CDP Private Cloud Base cluster.
-
Are the following steps complete on the source CDP Private Cloud Base cluster
or source CDH cluster (these steps are not required for COD sources)?
- Complete Step 1 of Migrating HBase data from CDH/HDP to
COD CDP Public Cloud to install the HBase replication plugin
parcel in the CDH source clusters.
This step is applicable for CDH versions 7.2.x that are lower than 7.2.2, versions 7.1.x that are lower than 7.1.5, and for versions lower than 7.x.
- Create the /user/hbase folder for the
hbase user in HDFS in the source cluster
using the following
commands:
sudo -u hdfs hdfs dfs -mkdir /user/hbase sudo -u hdfs hdfs dfs -chown hbase:hbase /user/hbase
These commands allow the HBase replication policy to replicate the existing data in the source cluster.
This step is applicable for Cloudera Manager versions 7.4.3 or lower; Cloudera Manager 7.4.4 only if the API version is lower than v45. The endpoint
http://[***CLOUDERA MANAGER HOST***]:[***CLOUDERA MANAGER PORT***]/api/version
shows the API version of Cloudera Manager.
- Complete Step 1 of Migrating HBase data from CDH/HDP to
COD CDP Public Cloud to install the HBase replication plugin
parcel in the CDH source clusters.
-
Is the required target cluster (Data Hub or COD) available and healthy?
-
Do you have the necessary permission to run the HBase replication jobs on
YARN?
To verify whether you have the necessary permission, perform the following steps:
- Go to the source Cloudera Manager > YARN service > Configuration tab.
- Ensure one of the following conditions is met:
- The Allowed System Users property must
have hbase. Otherwise, add
hbase to the existing property value.
This property lists the users permitted (or allowed) to run containers. Note that the users with IDs lower than the Minimum User ID property might be permitted (or allowed) to run containers.
- The Minimum User ID property is set to a
value that is lower than the HBase user's ID.
To check the HBase user's ID, SSH into a cluster node and run the
id hbase
command.The following sample snippet shows the HBase user's ID when you run theid hbase
command:# id hbase uid=39993(hbase) gid=39993(hbase) groups=39993(hbase)
- The Allowed System Users property must
have hbase. Otherwise, add
hbase to the existing property value.
-
Is the required cloud credential that you want to use in the replication policy
registered with the Replication Manager service?
For more information, see Working with cloud credentials.You can also add the following advanced configuration settings to use Google Cloud, Amazon S3, and ADLS accounts in Replication Manager:
-
Go to the source Cloudera Manager > Clusters > HDFS service > Configuration tab.
-
Locate the HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml property.
Add the following key-value pairs to register a Google account to use in Replication Manager:
- fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
- fs.gs.project.id=[***ENTER THE PROJECT ID***]
- fs.gs.system.bucket=[***ENTER THE BUCKET NAME***]
- fs.gs.working.dir=/
- fs.gs.auth.service.account.enable=true
- fs.gs.auth.service.account.email=[***ENTER THE SERVICE PRINCIPAL EMAIL ID***]
- fs.gs.auth.service.account.keyfile=/[***ENTER THE LOCAL PATH OF THE P12 FILE***]
- Add the following key-value pairs to register an S3 account to use in
Replication Manager:
- fs.s3a.access.key=[***ENTER THE SESSION ACCESS KEY***]
- fs.s3a.secret.key=[***ENTER THE SESSION SECRET KEY***]
- Add the following key-value pairs to register an ADLS account to use in
Replication Manager:
- fs.azure.account.oauth2.client.id=[***ENTER THE ABFS STORAGE CLIENT ID***]
- fs.azure.account.oauth2.client.secret=[***ENTER THE ABFS STORAGE CLIENT SECRET KEY***]
-
Save and restart the HDFS service for the changes to take effect.
-
- Do you have the required cluster access to create or view replication policies?
-
Have you assigned the managed identity of source roles, Storage Blob
Data Owner or Storage Blob Data
Contributor, to the destination storage data container and vice
versa for bidirectional replication when you are using COD on Microsoft
Azure?
The roles allow writing a snapshot in the destination cluster container.
-
Does DNS resolution work as expected between the source and destination
clusters?
- Is the outgoing SSH port open on the Cloudera Manager host?
- Do you need to replicate data securely? If so, ensure that the SSL/TLS certificate exchange between two Cloudera Manager instances that manage source and target clusters respectively is configured. For more information, see Configuring SSL/TLS certificate exchange between two Cloudera Manager instances.
-
Are the following ports open and available for Replication Manager?
Table 1. Minimum ports required for HBase replication policies Ports Service Description 2181 and 16020 Destination hosts of the AWS cluster or ADLS cluster (target cluster), and the Cloudera Manager server port on the source cluster Verify whether the ports 16020 for worker security group and 2181 for worker, master, and leader groups are open for connection from the source cluster to the destination cluster on AWS or Azure. This ensures that the source HBase service can communicate with Zookeeper and HBase services on the destination hosts uninterruptedly. For more information, see Ports for HBase replication. 16000 HMaster Open the port on the Master Nodes (HBase Master Node and any back-up HBase Master node). Before you select the Validate Replication option during the first HBase replication policy creation between two specific clusters, you must ensure that the port is open on the target cluster.7180 or 7183 Cloudera Manager Admin Console HTTP Open on the source cluster to enable Data lake Cloudera Manager to communicate to the on-premises Cloudera Manager. Connects to destination SDX Data Lake Cloudera Manager. 9000 Cloudera Manager Agent Open on the source and target cluster to retrieve diagnostic and log information. 6000-6049 Cluster Connectivity Manager (CCM) Required for SSL connections to the Control Plane via CCM to communicate with Replication Manager. 80 or 443 Data transfer from secondary node for AWS / ADLS Gen2 Outgoing port. Open on all the HDFS nodes for AWS and ADLS Gen2. 8443 Data Lake cluster Outgoing port. Configure the port on the Data Lake cluster as the outgoing port for CDP Management Console to communicate with Cloudera Manager and Knox. 8032 YARN Resource Manager Open on the source and target cluster to access the YARN ResourceManager. Consider the following best practices while using CDP Public Cloud on Microsoft Azure ADLS Gen2 (ABFS):- Ensure that the on-premises cluster (port 443) can access the https://login.microsoftonline.com endpoint. This is because the Hadoop client in the on-premises cluster (CDH/CDP Private Cloud Base) connects to the endpoint to acquire the access tokens before it connects to Azure ADLS storage. For more information, see the General Azure guidelines row in the Azure-specific endpoints table.
- Ensure that the steps mentioned in the General Azure guidelines and Azure Data Lake Storage Gen 2 rows in the Azure-specific endpoints table are complete so that the endpoint connects to the target path successfully.
- (Optional) Complete the steps mentioned in the Optimize HBase replication policy performance when replicating HBase tables with several TB data FAQ if you choose Perform Initial Snapshot during HBase replication policy creation to replicate HBase tables with several TB data.