Configuring Spark jobs for huge shuffle data
You can configure the Spark jobs to use the persistent volumes for shuffle data to improve the performance or handle the huge shuffle data.
By default, Cloudera Data Engineering (CDE) service does not use any PersistentVolume for shuffle data causing the shuffle data to spill over to the local disks. If the local disk space is not sufficient or local disk performance decreases, you must configure the Spark jobs to use a persistent volume for shuffle data at the job level.
Make sure that you have a compatible storage class that supports local volumes.
When creating a new job or editing an exisiting job, add the required
configurations in the
field.
For example, If you have the local volume storage class local-path and need a
volume size of maximum 10
Gi, then add the following configurations:
config_key | config_value |
---|---|
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path | /data |
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly | false |
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName | OnDemand |
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit | 10Gi |
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass | local-path |