Migrating Spark applications

During the Spark workflow migration, the job JAR file, job properties and other Spark job related data are migrated from a CDH or Cloudera Base on premises cluster to a Cloudera Data Hub cluster.

Before the migration, the source cluster is scanned to collect the Spark application JAR files. During the migration process, the Spark applications are not affected on the source cluster and can remain in running state. The source code of the Spark applications also remain the same, only the JAR files are copied from the Source cluster to the Destination cluster. To migrate the files that have dependency to the Spark application, you need to ensure to select them during the migration creation. When the migration is finished, the job definitions are stored in the S3 bucket and the application properties are stored in the local filesystem.

Ensure that Cloudera Migration Assistant is set up correctly using the steps in Setting up Cloudera Migration Assistant server.
Ensure that you have met the requirements detailed in Reviewing prerequisites before migration.
Ensure that you have a CDH 5, CDH 6 or Cloudera Base on premises cluster registered as a source from which you want to migrate your Spark applications. If you do not have a source cluster yet, complete the steps in Registering source clusters.
Ensure that you have a Cloudera Data Hub cluster registered as a destination cluster to which you want to migrate your Spark applications. If you do not have a destination cluster yet, complete the steps in Registering destination clusters.

Click on the CDH or Cloudera Base on premises cluster you want to use for the migration on the Clusters page.
Click Start Scanning to open the Scan Settings.
Select Spark application scan.

important
In case you want migrate the Spark application dependencies, such as HDFS files, you need to make sure to select the dependent services during the Source cluster scanning, and add the dependent files and configurations to the same Collection as the Spark applications.
1. Provide the Number of latest days to scan to define the period from which the Spark applications are collected.
2. Click Scan selected.
  You will be redirected to the scanning progress, where you can monitor if the scanning process was successful or encountered any error.
Click on Start Mapping to view the collected job definitions when the scan is finished.
Add Spark workloads to Collections.
Collections serve as an organization method to sort and bundle the job definitions into groups for the migration. You can create more collections beside the Default collection based on your requirements.
After you are finished with sorting the Spark workloads and their dependencies to collections, you can start the migration process by creating the migration plan.
Click Create Migration or select Migrations > Start Your First Migration.
1. Select the source cluster, and click Next.
2. Select the destination cluster, and click Next.
3. Select the type of migration, and click Next.
4. Select the collections that you want to migrate, and click Next.
  You can select if the migration should Run Now or be completed in a Scheduled Run. Run Now means that the Oozie job definitions in the selected collections are going to be migrated as soon as the process starts. When choosing the Scheduled Run, you can select the start date of the migration, and set a frequency in which the migration process should proceed.
5. Provide he Cloudera workload user and password will be used to access Cloudera Manager of the Cloudera Data Hub cluster in Cloudera on cloud.
6. Ensure that the path of the Folder for the Spark scripts on the target Data Lake is correct.
7. Click Next.
  An overview of the migration plan is displayed. At this point, you can go back and change any configuration if the information is not correct. If the information is correct, click Create.
Click Execute Migrations when the migration plan is successfully created.

You can use the Mapping tab to update the Spark workloads, and you can view the steps that will be completed during the migration on the Execution Step tab.

Under Configuration, you can view how the configuration are being mapped from the Source cluster to the Destination cluster. The configurable parameters on the destination cluster are filled out automatically based on how the Destination cluster was configured for the migration, but these parameters can be changed based on your requirements before executing the migration.
Click Run to start migration.
You also have the option to click and select Run All. In this case the migration steps are executed manually. Choosing Run All in Current Phase enables you to manually start the next phase of the migration.

During the Spark migration, the HDFS replication policies are created if there are any HDFS file dependencies for the Spark application. In the next step, the Spark shell scripts are created on the selected cloud storage. During the finalization, there is a manual step to review if the Spark application migration was successful to the Destination cluster. You can choose how to ensure that the Spark applications are on the Destination cluster, but you can use the CLI commands provided on the screen.

When all of the steps are successfully completed, the migration of Spark applications from CDH or Cloudera Base on premises to Cloudera on cloud is finished. You can restart the Spark applications on the destination Cloudera Data Hub cluster using Command Line Interface (CLI) or via Livy as instructed in the last step.