During the Spark workflow migration, the job JAR file, job properties and other Spark
job related data are migrated from a CDH or Cloudera Private Cloud Base cluster to a Cloudera Data Hub cluster.
Before the migration, the source cluster is scanned to
collect the Spark application JAR files. During the migration process, the Spark
applications are not affected on the source cluster and can remain in running state. The
source code of the Spark applications also remain the same, only the JAR files are
copied from the Source cluster to the Destination cluster. To migrate the files that
have dependency to the Spark application, you need to ensure to select them during the
migration creation. When the migration is finished, the job definitions are stored in
the S3 bucket and the application properties are stored in the local filesystem.
Ensure that you have a CDH 5, CDH 6 or Cloudera Private Cloud Base cluster registered as a
source from which you want to migrate your Spark applications. If you do not
have a source cluster yet, complete the steps in Registering source clusters.
Ensure that you have a Cloudera Data Hub cluster registered as
a destination cluster to which you want to migrate your Spark applications. If
you do not have a destination cluster yet, complete the steps in Registering destination clusters.
Click on the CDH or Cloudera Private Cloud Base
cluster you want to use for the migration on the Clusters
page.
Click Start Scanning to open the Scan
Settings.
Select Spark application scan.
Provide the Number of latest days to scan to
define the period from which the Spark applications are collected.
Click Scan selected.
You will be redirected to the scanning progress, where you can monitor
if the scanning process was successful or encountered any error.
Click on Start Mapping to view the collected job
definitions when the scan is finished.
Add Spark workloads to Collections.
Collections serve as an organization method to sort and bundle the job
definitions into groups for the migration. You can create more collections
beside the Default collection based on your
requirements.
After you are finished with sorting the Spark workloads and
their dependencies to collections, you can start the migration process by
creating the migration plan.
Click Create Migration or select Migrations > Start Your First Migration.
Select the source cluster, and click Next.
Select the destination cluster, and click
Next.
Select the type of migration, and click
Next.
Select the collections that you want to migrate, and click
Next.
You can select if the migration should Run Now
or be completed in a Scheduled Run.
Run Now means that the Oozie job definitions
in the selected collections are going to be migrated as soon as the
process starts. When choosing the Scheduled Run,
you can select the start date of the migration, and set a frequency in
which the migration process should proceed.
Provide the Knox token to access Cloudera Manager of the Cloudera Data Hub cluster in ClouderaPublic Cloud.
Navigate to the destination Cloudera Data Hub
cluster.
Select Knox Token from the list of
services.
Click Token generation, and provide the
name and life of the token.
Click Generate Token.
Copy the generated token, and navigate back to the migration
plan. Paste the token to the Knox Token
field.
Ensure that the path of the Folder for the Spark scripts on
the target Data Lake is correct.
Click Next.
An overview of the migration plan is displayed. At this point, you can
go back and change any configuration if the information is not correct.
If the information is correct, click
Create.
Click Go to Migrations when the migration plan is
successfully created.
Click on the CDH to Cloudera Public Cloud or
Cloudera Private Cloud Base to
Cloudera Public Cloud
migration to start the migration.
The steps are displayed that are going to be completed during the
migration.
You can use the Master Table to review the
Collections and Datasets of the
Spark workload and its dependencies that will be migrated.Under Configuration, you can view how the
configuration are being mapped from the Source cluster to the Destination
cluster. The configurable parameters on the Target cluster are filled out
automatically based on how the Destination cluster was configured for the
migration, but these parameters can be changed based on your requirements before
executing the migration.
Click to start
migration.
You also have the option to click and select Run All. In this
case the migration steps are executed manually. Choosing Run All in
Current Phase enables you to manually start the next phase of
the migration.
During the Spark migration, the HDFS replication policies are created if
there are any HDFS file dependencies for the Spark application. In the next
step, the Spark shell scripts are created on the selected cloud storage.
During the finalization, there is a manual step to review if the Spark
application migration was successful to the Destination cluster. You can
choose how to ensure that the Spark applications are on the Destination
cluster, but you can use the CLI commands provided on the screen.
To finish the migration of Spark applications, click and
select Mark Active Step as Completed.
When all of the steps are successfully completed, the migration of Spark
applications from CDH or Cloudera Private Cloud Base to ClouderaPublic Cloud is finished. You can restart the Spark
applications on the destination Cloudera Data Hub cluster using Command
Line Interface (CLI).