Migrating data from CDH to CDP Public Cloud

Follow the following steps to migrate your data and metadata from a CDH cluster to a CDP Public Cloud cluster.

Registering source and target clusters for data migration

Register your source and target clusters for migration.

  1. Click New Migration on the left navigation pane.
  2. Select Cloudera Distributed Hadoop 6 from the drop-down menu.
  3. Register the source CDH cluster by providing the Cloudera Manager URL of the CDH cluster, the Cloudera Manager admin user and Cloudera Manager admin password.
  4. Click Connect. If the connection is successful, click Next.
  5. Select Migrating Data to CDP Public Cloud. Click Next.

    A page for connecting to the CDP Control Plane is displayed

  6. Enter the Control Plane URL, admin user and password. Click Connect.

    A green checkmark appears upon a successful connection to the Control Plane.

  7. Select the target cluster, and click Scan.
  8. Select one of the radio buttons: enter the SSH user and port, or click Choose File to upload the SSH key for the source cluster nodes.
  9. Provide the S3 bucket access key, S3 bucket secret key, and the S3 bucket base paths for HDFS files and Hive external tables.
  10. Click Next.
    The Overview page appears.
  11. Click Create.
    The Migration Creation status page appears. When all of the migration creation steps are completed, click on the Go To Migrations link. In-progress migrations appear as blue tiles on the Migrations page.

Performing the data migration

This section helps you understand the steps required to perform the data migration and how to approach the migration workflow.

Running the initial cluster data scan

The Assessment stage includes an initial cluster data scan that must be successfully before you can proceed with the migration.

The initial cluster data scan collects Hive and HDFS data, then runs a tool to detect any problems with the Hive tables.

Click the Play button at the bottom of Initial Cluster Data Scan in the Assessment box to start the cluster scan. When the scan completes, a green checkmark appears next to the scan link and you can view the validation results in the Output tab.

Working with the master table

The master table is used to display and manage data sets based on the source cluster scan. In the master table, you can review potential issues with the Hive tables, create and assign labels that help you migrate related data sets together.

Reviewing the data sets

In the Datasets tab of the Master Table page, you can review all of the Hive tables identified in the initial cluster scan, as well as the associated HDFS locations.

In the SRE column, you can review any potential issues that may arise with a certain table during the migration.

You can click the link in SRE column to get warnings and recommendations for actions to take before you proceed with the migration. The tool being used is the Hive-SRE tool. For more information, see the Git repository.

Using labels

In the Labels tab of the Master Table page, you can create and assign labels that allow you to classify related data sets and migrate them together, creating multiple replication policies. There is no limit to the amount of labels that you can create. These labels belong to the master table and can be reused in subsequent migrations from the same source.

  1. In the Master Table, click the Labels tab at the top left and then the Create button.
  2. Give the label a name and select a color for the label. Then click Save.

    The new label appears within the Master Table.

  3. To assign the label to a table, return to the Datasets tab and click Assign.
  4. Type the name of the label in the Label name field, then select the checkbox next to any tables that you want to assign the label to.
  5. Click the Save icon. The labels will appear next to the selected tables in the master table.

    You can repeat this process for HDFS locations as well, and assign as many labels as you would like.

Selecting and submitting the data for replication

Once you have labeled all of the data sets that you want to include in a particular migration, you can use those labels to create the data replication policies that perform the actual data migration. In the background, CMA uses Replication Manager to create replication policies from the labeled data sets that you have created.

  1. In the Labels tab of the Master Table, click Replicate and type the name of the label that you assigned to the data set that you want to migrate. Press Enter.
  2. Click the Save icon. In the background CMA generates a replication policy, but it is not submitted immediately. You can repeat the previous step to include as many labels as necessary for the migration.

    When you have successfully added all of the labels that you want to replicate for this specific migration, you are ready to proceed with the migration. Return to the migration home by clicking the arrow next to Master Table or selecting Migrations in the left-hand navigation.

  3. Click the Start button next in the Data and Metadata Migration box to immediately submit the HDFS and Hive replication policies and start the data replication process.