Migrating SQL queries

During the Hive migration beside the SQL query, the query related tables and data are also migrated from a CDH cluster to a Data Hub cluster.

Before the migration, the source cluster is scanned to collect the SQL queries, tables and data from Hive or Impala. This migration can be used in cases when there is a heavy SQL query load and you want to unload the less time sensitive queries to another cluster. Using the scheduling feature of the underlying Replication Manager, you can keep the queries in sync between the source and target cluster. During the migration process, the SQL queries are not affected on the source cluster and can remain in running state.
  • Ensure that CMA is set up correctly using the steps in Setting up CMA server.
  • Ensure that you have met the requirements detailed in Reviewing prerequisites before migration.
  • Ensure that you have a CDH 5 or CDH 6 cluster registered as a source from which you want to migrate your Hive queries. If you do not have a source cluster yet, complete the steps in Registering source clusters.
  • Ensure that you have a Data Engineering Data Hub cluster registered as a target cluster to which you want to migrate your Hive queries. If you do not have a target cluster yet, complete the steps in Registering target clusters.
  1. Click on the CDH cluster you want to use for the migration on the Clusters page.
  2. Click Start Scanning to open the Scan Settings.
  3. Select Hive table scan, Hive table check and Hive workflow scan.
    1. Provide the Hive query parser input.
      You can pre-scan Hive2 SQL queries against Hive3 with the Hive Workflow scan option. When selecting this Hive Workflow option, you need to provide the location of your queries as shown in the following example:
      • HDFS paths
        • With default namespace: hdfs:///dir/, hdfs:///dir/file
        • With specified namespace: hdfs://namespace1/dir, hdfs://namespace1/dir/file
        • With namenode address: hdfs://nameNodeHost:port:/dir, hdfs://nameNodeHost:port:/dir/file
      • Native file paths
        • your/local/dir
        • nodeFQDN:/your/local/dir/sqlFile
    2. Click Scan selected.
      You will be redirected to the scanning progress, where you can monitor if the scanning process was successful or encountered any error.
  4. Click on Hive SQL to view the collected queries when the scan is finished.
    You can also find the tables that are related to the queries under Hive tables.
  5. Add the Hive queries to Collections.
    Collections serve as an organization method to sort and bundle the queries into groups for the migration. You can create more collections beside the Default collection based on your requirements. The Hive tables that belong to the Hive queries are automatically added to the same collection.

    After you are finished with sorting the queries to collections, you can start the migration process by creating the migration plan.

  6. Click Create Migration or select Migrations > Start Your First Migration.
    1. Select the source cluster, and click Next.
    2. Select the target cluster, and click Next.
    3. Select the type of migration, and click Next.
    4. Select the collections that you want to migrate, and click Next.
      You can select if the migration should Run Now or be completed in a Scheduled Run. Run Now means that the Hive queries in the selected collections are going to be migrated as soon as the process starts. When choosing the Scheduled Run, you can select the start date of the migration, and set a frequency in which the migration process should proceed. In case your goal is to keep the queries in sync between the source and target cluster, select the Scheduled Run with a frequent time period for migration.
    5. Review the default configurations that are filled out automatically.
    6. Click Next.
      An overview of the migration plan is displayed. At this point, you can go back and change any configuration if the information is not correct. If the information is correct, click Create.
  7. Click Go to Migrations when the migration plan is successfully created.
  8. Click on the CDH to CDP PC migration to start the migration.
    The steps are displayed that are going to be completed during the migration.
  9. Click to start migration.

    During the Hive SQL migration, a replication policy is created using the Replication Manager. When the policy is created, click to start uploading the SQL migration. At this step, the Hive scripts from the source cluster are copied to the Hive S3 bucket on the target. When the Hive SQL Migration is finished, click to finalize the replication policies.

When all of the steps are successfully completed, the migration of Hive queries from CDH to CDP Public Cloud is finished. You can restart the queries on the target Data Engineering Data Hub cluster using Command Line Interface (CLI) or Hue.