Troubleshooting replication policies in CDP Public Cloud

The troubleshooting scenarios in this topic help you to troubleshoot issues in the Replication Manager service in CDP Public Cloud.

Different methods to identify errors related to failed replication policy

What are the different methods to identify errors while troubleshooting a failed replication policy?

You can choose one of the following methods to identify the errors to troubleshoot a job failure:
  • On the Replication Policies page, click the failed job in the Job History pane. The errors for the failed job appear.
    The following sample image shows the Job History pane for a replication policy job:
  • In the source and target Cloudera Manager, click Running Commands on the left navigation bar. The recent command history shows the failed commands.
    The following sample image shows the Running Commands page for an HBase replication policy:
  • On the source cluster and target cluster, open the service logs to track the errors (For example, HBase service logs).

    You can also search on the Cloudera Manager > Diagnostics > Logs page to view the logs.

HDFS replication policy fails due to export HTTPS_PROXY environment variable

HDFS replication policies fail when the export HTTPS_PROXY environment variable is set to access AWS through proxy servers. How to resolve this issue?

Remedy

To resolve this issue, perform the following steps:
  1. Open the core-site.xml file on the source cluster.
  2. Enter the following properties in the file:
    <property>
      <name>fs.s3a.proxy.host</name>
      <description>Hostname of the (optional) proxy server for S3 connections.</description>
    </property>
    
    <property>
      <name>fs.s3a.proxy.port</name>
      <description>Proxy server port. If this property is not set
        but fs.s3a.proxy.host is, port 80 or 443 is assumed (consistent with
        the value of fs.s3a.connection.ssl.enabled).</description>
    </property>
    
  3. Save and close the file.
  4. Restart the source Cloudera Manager.
  5. Run the failed HDFS replication policies in Replication Manager.
    Replication Manager completes the replication successfully.

Cannot find destination clusters for HBase replication policies

When you ping destination clusters using their host names, the source cluster hosts for HBase replication policies do not find the destination clusters. How to resolve this issue?

Cause

This might occur for on-premises clusters such as CDP Private Cloud Base clusters or CDH clusters because the source clusters are not on the same network as the destination Data Hub. Therefore, hostnames cannot be resolved by the DNS service on the source cluster.

Remedy

Add the destination Region Server and Zookeeper IP to host name mappings in the /etc/hosts files of all the Region Servers on the source cluster.
The following snippet shows the contents in a sample /etc/hosts file:
10.115.74.181 dx-7548-worker2.dx-hbas.x2-8y.dev.dr.work
10.115.72.28 dx-7548-worker1.dx-hbas.x2-8y.dev.dr.work
10.115.73.231 dx-7548-worker0.dx-hbas.x2-8y.dev.dr.work
10.115.72.20 dx-7548-master1.dx-hbas.x2-8y.dev.dr.work
10.115.74.156 dx-7548-master0.dx-hbas.x2-8y.dev.dr.work
10.115.72.70 dx-7548-leader0.dx-hbas.x2-8y.dev.dr.work

HBase replication policy fails when Perform Initial Snapshot is chosen

An HBase replication policy fails for COD on Microsoft Azure when the "Perform Initial Snapshot" option is chosen but data replication is successful when the option is not chosen. How to resolve this issue?

Cause

This issue appears when the required managed identity of source roles are not assigned.

Remedy

Assign the managed identity of source roles, Storage Blob Data Owner or Storage Blob Data Contributor, to the destination storage data container and vice versa for bidirectional replication.
The roles allow writing a snapshot in the destination cluster container.

Optimize HBase replication policy performance when replicating HBase tables with several TB data

Can HBase replication policy performance be optimized when replicating HBase tables with several TB of data if the "Perform Initial Snapshot" option is chosen during HBase replication policy creation?

Complete the following manual steps to optimize HBase replication policy performance when replicating several TB of HBase data if you choose the Perform Initial Snapshot option during HBase replication policy creation.

Remedy

  1. Before you create the HBase replication policy, perform the following steps:
    1. Navigate to the source Cloudera Manager > YARN service > Configuration tab.
    2. Search for the mapreduce.task.timeout parameter.
    3. Increase the value or set it to 0 to switch off the timeout.
    4. Restart the YARN service.
    5. Navigate to the source Cloudera Manager > HBase service > Configuration tab.
    6. Search and configure the following key-value pairs:
      • hbase.snapshot.master.timeout.millis = 840000
      • hbase.client.sync.wait.timeout.msec = 180000

      • hbase.client.operation.timeout = 2400000

      • hbase.client.procedure.future.get.timeout.msec = 3000000

      • hbase.hfilearchiver.thread.pool.max=100

      • hbase.snapshot.thread.pool.max=24

    7. Restart the HBase service.
    8. Perform steps e through g on the target Cloudera Manager.
  2. When you create the HBase replication policy for the first time using the above configured source cluster, you must increase the Maximum Map Slots value to a higher number on the Advanced Settings page.
  3. If Store File Tracking (SFT) is enabled in the target COD, perform the steps mentioned in the COD migration topic after the replication policy creation is complete.

Partition metadata replication takes a long time to complete

How can partition metadata replication be improved when the Hive tables use several Hive partitions?

Hive metadata replication process takes a long time to complete when the Hive tables use several Hive partitions. This is because the Hive partition parameters are compared during the import stage of the partition metadata replication process and if the exported and existing partition parameters do not match, the partition is dropped and recreated. You can configure a key-value pair to support partition metadata replication.

  1. Go to the Cloudera Manager > Clusters > Hive service > Configuration tab.
  2. Search for the Hive Replication Environment Advanced Configuration Snippet (Safety Valve) property.
  3. Enter the HIVE_IGNORED_PARTITION_PARAMETERS=[***comma separated list of Hive partition parameters***] key-value pair.
    For example,
    HIVE_IGNORED_PARTITION_PARAMETERS=transient_lastDdlTime,totalSize,numRows,COLUMN_STATS_ACCURATE,numFiles

    The partition parameter names you provide are not compared during the import stage of the partition metadata replication process. Therefore, even if the partition parameters do not match between the exported and existing partitions, the partition is not dropped or recreated. After you configure this key-value pair, the import stage of the partition metadata replication process completes faster.

  4. Save the changes, and restart the Hive service.