Configuring Spark jobs for huge shuffle data

You can configure the Spark jobs to use the persistent volumes for shuffle data to improve the performance or handle the huge shuffle data.

By default, Cloudera Data Engineering (CDE) service does not use any PersistentVolume for shuffle data causing the shuffle data to spill over to the local disks. If the local disk space is not sufficient or local disk performance decreases, you must configure the Spark jobs to use a persistent volume for shuffle data at the job level.

Make sure that you have a compatible storage class that supports local volumes.

When creating a new job or editing an exisiting job, add the required configurations in the Job details > Configurations field.
For example, If you have the local volume storage class local-path and need a volume size of maximum 10 Gi, then add the following configurations:
config_key config_value
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path /data
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly false
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName OnDemand
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit 10Gi
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass local-path