Upgrading Apache Spark 2.4.8 (with 3.3.2 bundled) on 7.2.17 to Spark 3 on 7.3.1

The following steps will help you upgrading from Apache Spark 2.4.8 (with 3.3.2 bundled) on Cloudera Public Cloud 7.2.17 to Spark 3.4.1 on 7.3.1.

Source cluster version Source cluster Spark 2 version Source cluster Spark 3 version Target cluster version Target cluster Spark 3 version Spark 2 used with connectors1
7.2.17 2.4.8 3.3.2 (bundled) 7.3.1 3.4.1 no

Intermediate in-place cluster upgrade

Upgrade the cluster OS from Centos 7 to RedHat 8.

Sidecar migration of Data Hub clusters

The new 7.2.17 Data Hub cluster needs to use Spark 3 and Livy 3 instead of Spark 2 and Livy 2.

Depending on the template you used for your existing Data Hub clusters, a new custom template might be needed that contains Spark 3 instead of Spark 2. Alternatively, the built-in 7.2.17 - Data Engineering: Apache Spark3 template can be used, as it contains Spark 3 only.

  1. Check the current services in your template, and add the built-in 7.2.17 - Data Engineering: Apache Spark3 template.
  2. If the built-in 7.2.17 - Data Engineering: Apache Spark3 template doesn't work, you can create a custom template. Replace Spark 2 and Livy 2 references with Spark 3 and Livy 3, respectively..
  3. Add a new Spark 3-based 7.2.17 Data Hub cluster to the environment, using your custom template or the built-in 7.2.17 - Data Engineering: Apache Spark3 template.
  4. Migrate all non-spark workloads from the old Data Hub cluster to the new cluster.

Application migration tasks (Spark 2 to 3)

  1. Follow the Spark application migration documentation to migrate your Apache Spark Applications from version 2.4.8 to 3.3.2.
    1. Check the supported Java versions.
    2. Check the supported Scala version.
    3. Check the supported Python versions.
    4. Account for changed or versioned Spark commands in your code. (spark-submit, pyspark, etc.)
    5. Check supported versions for Spark connectors.
    6. Check the logging library used in your code.
    7. Check the compatibility of 3rd-party libraries used in your code.
    8. Check Spark behavior changes and refactor your code.
  2. Migrate all Spark 2 applications in the old Data Hub cluster to Spark 3 applications in the new cluster.

Post-application migration tasks

  1. Move Spark 2 event logs to the Spark 3 event logs directory.
  2. Drop the old Data Hub cluster.

In-place cluster upgrade

  1. Upgrade the Data Lake cluster to 7.3.1
    1. Check the support matrix for Data Hub upgrades.
    2. Stop all Data Hubs attached to the environment.
    3. From the Management Console, click Data Lakes > Environment Name, scroll to the bottom of the Data Lake details page, and click the Upgrade tab.
    4. Click the Target Cloudera Runtime Version drop-down menu to see any available upgrades.
    5. If you want to skip the automatic backup that is taken before the upgrade, uncheck the Automatic backup box.
    6. Click Validate and Prepare to check for any configuration issues and begin the Cloudera Runtime parcel download and distribution.
    7. Click Upgrade to initiate the upgrade.
    8. Click the Event History tab to monitor the upgrade process and verify that it completes successfully.
    For more information, see Data Lake upgrade.
  2. Upgrade the new Data Hub cluster to 7.3.1
    1. Check the support matrix for Data Hub upgrades.
    2. Start the cluster.
    3. Check the current version of Cloudera Runtime.
    4. If your cluster uses Streams Replication Manager, export or migrate aggregated metrics.
    5. If you use autoscaling, disable autoscaling on the cluster.
    6. Upgrade the cluster.
    7. Monitor the upgrade progress using the Data Hub Event History tab.
    8. When the upgrade is complete, verify the new version.
    9. If you disabled autoscaling on the cluster, you can re-enable it after upgrade.
    For more information, see Upgrading Data Hubs.

Application migration tasks (Spark 3.x to 3.4.1)

Follow the Spark application migration documentation to migrate your Apache Spark Applications from version 3.3.2 to 3.4.1
  1. Refactor your Spark application code.

Final steps

After the upgrade and application migration are complete:
  1. Check the status of your Data Lakes, Data Hubs, and clusters.
  2. Perform benchmark testing on your applications. See Spark Application Migration.
1 Oozie, Solr, Phoenix, Hive Warehouse Connector, Spark Schema Registry