You must perform a number of tasks before refactoring workloads.
- Identify all the workloads in the cluster (CDH/HDP) which are running on Spark 1.6 - 2.3.
Classify the workloads.
Classification of workloads will help in clean-up of the unwanted workloads, plan resources and efforts for workload migration and post upgrade testing.Example workload classifications:
- Spark Core (scala)
- Java-based Spark jobs
- SQL, Datasets, and DataFrame
- Structured Streaming
- MLlib (Machine Learning)
- PySpark (Python on Spark)
- Batch Jobs
- Scheduled Jobs
- Ad-Hoc Jobs
- Critical/Priority Jobs
- Huge data Processing Jobs
- Time taking jobs
- Resource Consuming Jobs etc.
- Failed Jobs
Identify configuration changes
Check the current Spark jobs configuration.
- Spark 1.6 - 2.3 workload configurations which have dependencies on job properties like scheduler, old python packages, classpath jars and might not be compatible post migration.
- In CDP, Capacity Scheduler is the default and recommended scheduler. Follow Fair Scheduler to Capacity Scheduler transition guide to have all the required queues configured in the CDP cluster post upgrade. If any configuration changes are required, modify the code as per the new capacity scheduler configurations.
- For workload configurations, see the Spark History server UI http://spark_history_server:18088/history/<application_number>/environment/.
- Identify and capture workloads having data storage locations (local and HDFS) to refactor the workloads post migration.
- Refer to unsupported Apache Spark features, and plan refactoring accordingly.