Migrating Spark applications

During the Spark workflow migration, the job JAR file, job properties and other Spark job related data are migrated from a CDH or Cloudera Private Cloud Base cluster to a Cloudera Data Hub cluster.

Before the migration, the source cluster is scanned to collect the Spark application JAR files. During the migration process, the Spark applications are not affected on the source cluster and can remain in running state. The source code of the Spark applications also remain the same, only the JAR files are copied from the Source cluster to the Destination cluster. To migrate the files that have dependency to the Spark application, you need to ensure to select them during the migration creation. When the migration is finished, the job definitions are stored in the S3 bucket and the application properties are stored in the local filesystem.
  1. Click on the CDH or Cloudera Private Cloud Base cluster you want to use for the migration on the Clusters page.
  2. Click Start Scanning to open the Scan Settings.
  3. Select Spark application scan.
    1. Provide the Number of latest days to scan to define the period from which the Spark applications are collected.
    2. Click Scan selected.
      You will be redirected to the scanning progress, where you can monitor if the scanning process was successful or encountered any error.
  4. Click on Start Mapping to view the collected job definitions when the scan is finished.
  5. Add Spark workloads to Collections.
    Collections serve as an organization method to sort and bundle the job definitions into groups for the migration. You can create more collections beside the Default collection based on your requirements.

    After you are finished with sorting the Spark workloads and their dependencies to collections, you can start the migration process by creating the migration plan.

  6. Click Create Migration or select Migrations > Start Your First Migration.
    1. Select the source cluster, and click Next.
    2. Select the destination cluster, and click Next.
    3. Select the type of migration, and click Next.
    4. Select the collections that you want to migrate, and click Next.
      You can select if the migration should Run Now or be completed in a Scheduled Run. Run Now means that the Oozie job definitions in the selected collections are going to be migrated as soon as the process starts. When choosing the Scheduled Run, you can select the start date of the migration, and set a frequency in which the migration process should proceed.
    5. Provide the Knox token to access Cloudera Manager of the Cloudera Data Hub cluster in Cloudera Public Cloud.
      1. Navigate to the destination Cloudera Data Hub cluster.
      2. Select Knox Token from the list of services.
      3. Click Token generation, and provide the name and life of the token.
      4. Click Generate Token.
      5. Copy the generated token, and navigate back to the migration plan. Paste the token to the Knox Token field.
    6. Ensure that the path of the Folder for the Spark scripts on the target Data Lake is correct.
    7. Click Next.
      An overview of the migration plan is displayed. At this point, you can go back and change any configuration if the information is not correct. If the information is correct, click Create.
  7. Click Go to Migrations when the migration plan is successfully created.
  8. Click on the CDH to Cloudera Public Cloud or Cloudera Private Cloud Base to Cloudera Public Cloud migration to start the migration.
    The steps are displayed that are going to be completed during the migration.
    You can use the Master Table to review the Collections and Datasets of the Spark workload and its dependencies that will be migrated.
    Under Configuration, you can view how the configuration are being mapped from the Source cluster to the Destination cluster. The configurable parameters on the Target cluster are filled out automatically based on how the Destination cluster was configured for the migration, but these parameters can be changed based on your requirements before executing the migration.
  9. Click to start migration.
    You also have the option to click and select Run All. In this case the migration steps are executed manually. Choosing Run All in Current Phase enables you to manually start the next phase of the migration.

    During the Spark migration, the HDFS replication policies are created if there are any HDFS file dependencies for the Spark application. In the next step, the Spark shell scripts are created on the selected cloud storage. During the finalization, there is a manual step to review if the Spark application migration was successful to the Destination cluster. You can choose how to ensure that the Spark applications are on the Destination cluster, but you can use the CLI commands provided on the screen.

    To finish the migration of Spark applications, click and select Mark Active Step as Completed.

When all of the steps are successfully completed, the migration of Spark applications from CDH or Cloudera Private Cloud Base to Cloudera Public Cloud is finished. You can restart the Spark applications on the destination Cloudera Data Hub cluster using Command Line Interface (CLI).