Configuring Spark jobs for large shuffle data
Learn about configuring the Spark jobs to use the persistent volumes for shuffle data to improve the performance or handle the huge shuffle data.
By default, Cloudera Data Engineering service does not use any persistent volume for shuffle data causing the shuffle data to spill over to the local disks. If the local disk space is not sufficient or local disk performance decreases, you must configure the Spark jobs to use a persistent volume for shuffle data at the job level.
-
You must have a compatible storage class that supports local volumes.
-
Copy the following manifest into a new file, named
cde-storageclass.yaml:
apiVersion: v1 kind: Namespace metadata: name: local-path-storage-cde --- apiVersion: v1 kind: ServiceAccount metadata: name: local-path-provisioner-service-account namespace: local-path-storage-cde --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: local-path-provisioner-role-cde rules: - apiGroups: [ "" ] resources: [ "nodes", "persistentvolumeclaims", "configmaps" ] verbs: [ "get", "list", "watch" ] - apiGroups: [ "" ] resources: [ "endpoints", "persistentvolumes", "pods" ] verbs: [ "*" ] - apiGroups: [ "" ] resources: [ "events" ] verbs: [ "create", "patch" ] - apiGroups: [ "storage.k8s.io" ] resources: [ "storageclasses" ] verbs: [ "get", "list", "watch" ] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: local-path-provisioner-bind-cde roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: local-path-provisioner-role-cde subjects: - kind: ServiceAccount name: local-path-provisioner-service-account namespace: local-path-storage-cde --- apiVersion: apps/v1 kind: Deployment metadata: name: local-path-provisioner-cde namespace: local-path-storage-cde spec: replicas: 1 selector: matchLabels: app: local-path-provisioner-cde template: metadata: labels: app: local-path-provisioner-cde spec: serviceAccountName: local-path-provisioner-service-account containers: - name: local-path-provisioner image: <YOUR_REGISTRY>/cloudera_thirdparty/rancher/local-path-provisioner:v0.0.31 imagePullPolicy: IfNotPresent command: - local-path-provisioner - --debug - start - --config - /etc/config/config.json - --provisioner-name - rancher.io/local-path-cde volumeMounts: - name: config-volume mountPath: /etc/config/ env: - name: POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace volumes: - name: config-volume configMap: name: local-path-config --- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: local-path-cde provisioner: rancher.io/local-path-cde volumeBindingMode: WaitForFirstConsumer reclaimPolicy: Delete --- kind: ConfigMap apiVersion: v1 metadata: name: local-path-config namespace: local-path-storage-cde data: config.json: |- { "nodePathMap":[ { "node":"DEFAULT_PATH_FOR_NON_LISTED_NODES", "paths":["/ecs/ecs/local-storage-cde"] } ] } setup: |- #!/bin/sh while getopts "m:s:p:" opt do case $opt in p) absolutePath=$OPTARG ;; s) sizeInBytes=$OPTARG ;; m) volMode=$OPTARG ;; esac done mkdir -m 700 -p ${absolutePath} chown 1345:1345 ${absolutePath} teardown: |- #!/bin/sh while getopts "m:s:p:" opt do case $opt in p) absolutePath=$OPTARG ;; s) sizeInBytes=$OPTARG ;; m) volMode=$OPTARG ;; esac done rm -rf ${absolutePath} helperPod.yaml: |- apiVersion: v1 kind: Pod metadata: name: helper-pod spec: containers: - name: helper-pod image: <YOUR_REGISTRY>/cloudera_thirdparty/hardened/busybox:glibc-1.37.0-r0-202410311742 imagePullPolicy: IfNotPresent securityContext: privileged: true runAsUser: 0 # Run as root user runAsGroup: 0 # Use root group - In the cde-storageclass.yaml file that you have created, locate and replace the <YOUR REGISTRY> placeholder with your container registry path.
-
Open a terminal with
kubectlaccess to your cluster and run the following command to set up the storage class for Cloudera Data Engineering:kubectl apply -f cde-storageclass.yaml -
When creating a new job or editing an existing job in the Cloudera Data Engineering UI, add the required configurations in the Job details > Configurations field.
For example, if you use the
local-path-cdelocal volume storage class and require a volume size of up to 10 Gi, add the following configurations:Configuration key Value spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path /dataspark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly falsespark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName OnDemandspark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit 10Gispark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass local-path-cde
