Migrating data from CDH or HDP to CDP Public Cloud

Before you migrate your data to a CDP Public Cloud deployment, you must download certain tools and scripts required for your data migration. Your CDH or HDP cluster is your source cluster, and your CDP Data Hub cluster or Cloudera Operational Database experience is your destination cluster deployed on the public cloud.

You must install the following tools on your source cluster before you start the data migration process.

  • Replication plugin obtained from Cloudera
  • Cloudera DataPlane Platform (DP)
  • AWS Command Line Interface (AWS CLI)
  • Python 3
  • jq JSON processor
  1. Set up the DataPlane Platform on the source cluster by running the dp configure command in the command-line interface of the source cluster:
    dp configure --server https://console.cdp.cloudera.com/ --apikeyid CDP_KEY_ID --privatekey CDP_PRIVATE_KEY
    
  2. Run this command in the command-line interface of the source cluster to set up an awscli profile. The profile must have write access to the AWS environment.
    aws configure --profile hbase-rep

    You must provide the AWS keys when prompted by the command for credentials. The AWS credentials specified here must have write access, as the install script needs to change security group definitions. For additional security, you can create a new user under the AWS management console, set it to an admin group, create new credentials for this user, and download these credentials to use only for this migration.

  3. On the source and the destination HBase clusters, configure the HBase Client Environment Advanced Configuration Snippet (Safety Valve) for hbase-env.sh for Gateway and RegionServer Environment Advanced Configuration Snippet (Safety Valve) for RegionServer.

    For the source cluster running CDH, provide the following plugin path:

    HBASE_CLASSPATH=$HBASE_CLASSPATH:/opt/cloudera/parcels/cloudera-opdb-replication-[***REPLICATION PLUGIN VERSION***]-cdh[***CDH VERSION***]-SNAPSHOT/lib/*
    

    For example, HBASE_CLASSPATH=$HBASE_CLASSPATH:/opt/cloudera/parcels/cloudera-opdb-replication-1.0-cdh5.14.4-SNAPSHOT/lib/*

    For the source cluster running HDP, find the line with below content in Ambari UI > CONFIGS > ADVANCED > Advanced hbase-env > hbase-env template:
    export HBASE_CLASSPATH=${HBASE_CLASSPATH}
    Modify this line to have the following content:
    xport HBASE_CLASSPATH=${HBASE_CLASSPATH}:/usr/hdp/cloudera-opdb-replication-[***REPLICATION PLUGIN VERSION***]-hdp[***version***]-SNAPSHOT/lib/*

    For example, export HBASE_CLASSPATH=${HBASE_CLASSPATH}:/usr/hdp/cloudera-opdb-replication-1.0-hdp2.6.5-SNAPSHOT/lib/*

    For the destination Data Hub cluster or COD instance, provide the following plugin path:

    HBASE_CLASSPATH=/opt/cloudera/parcels/cloudera-opdb-replication-[***REPLICATION PLUGIN VERSION***]-SNAPSHOT/lib/*
    
  4. On the destination Data Hub cluster, add the following property values to the HBase Client Advanced Configuration Snippet (Safety Valve) for hbase-site.xml.
    <property>
       <name>hbase.security.external.authenticator</name>
       <value>com.cloudera.hbase.security.pam.PamAuthenticator</value>
     </property>
    
    <property>
       <name>hbase.security.replication.credential.provider.path</name>
       <value>cdprepjceks://hdfs@ns1/hbase-replication/credentials.jceks</value>
    </property>
    
    <property>
        <name>hbase.client.sasl.provider.extras</name>
        <value>com.cloudera.hbase.security.provider.CldrPlainSaslClientAuthenticationProvider</value>
    </property>
    
    <property>
        <name>hbase.client.sasl.provider.class</name>
        <value>com.cloudera.hbase.security.provider.CldrPlainSaslAuthenticationProviderSelector</value>
    </property>
    
    <property>
        <name>hbase.server.sasl.provider.extras</name>
        <value>com.cloudera.hbase.security.provider.CldrPlainSaslServerAuthenticationProvider</value>
    </property>
  5. Restart the RegionServers and add the client configurations on both source and destination clusters.
  6. Run the install_script located at $PATH/cloudera-opdb-replication-[***REPLICATION PLUGIN VERSION***]-cdh[***CDH VERSION***]-SNAPSHOT/bin, where $PATH is the location where the extracted tarball files is present.
    You need the following information to successfully run the install script:
    • The replciation plugin that obtained from Cloudera

    • A user with root access to the source cluster hosts

    • A text file containing the IP addresses of each RegionServer host on the source cluster

    • Your destination Data Hub cluster name or you COD instance name

    • Your system user name and password to be created at your Data Hub cluster or COD instance

    • Your Customizable Single Sign-On (CSSO) user credentials to log into the destination Data Hub or COD hosts

    • Your SSH credentials file for the Cloudbreak user

  7. Take the snapshot in Cloudera Manager.
    1. Select the HBase service.
    2. Click the Table Browser tab.
    3. Click a table.
    4. Click Take Snapshot.
    5. Specify the name of the snapshot, and click Take Snapshot.
  8. Export the snapshot using the ExportSnapshot tool.
    You must run the ExportSnapshot command as the hbase user or the user that owns the files.

    The ExportSnapshot tool executes a MapReduce Job similar to distcp to copy files to the other cluster. It works at the file-system level, so the HBase cluster can be offline. You must run the ExportSnapshot command as the hbase user or the user that owns the files.

    Run this command in the HBase Shell on the source cluster to export a snapshot from the source to the destination cluster
    hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot <snapshot name> -copy-to hdfs://destination:hdfs_port/hbase -mappers 16
    Here, destination (hdfs://destination:hdfs_port/hbase) is the destination CDP public cloud cluster. Replace the HDFS server path and port with the ones you have used for your cluster.
  9. Enable the peer on the source cluster using the enable_peer ("<peerID>") command.
    Run this command in the HBase Shell on the source cluster to enable the peer in the source and destination n the source and destination
    enable_peer("ID1")
  10. Use the HashTable/SyncTable tool to ensure that data is synchronized between your source and destination clusters.

    Run the HashTable command on the source cluster and the SyncTable command on the destination cluster to synchronize the table data between your source and destination clusters.

    On the source cluster
    HashTable [options] <tablename> <outputpath>
    On the destination cluster
    SyncTable [options] <sourcehashdir> <sourcetable> <targettable>

    For more information and examples for using HashTable and SyncTable, see Use HashTable and SyncTable tool.