Configuring Spark jobs for large shuffle data

Learn about configuring the Spark jobs to use the persistent volumes for shuffle data to improve the performance or handle the huge shuffle data.

By default, Cloudera Data Engineering service does not use any persistent volume for shuffle data causing the shuffle data to spill over to the local disks. If the local disk space is not sufficient or local disk performance decreases, you must configure the Spark jobs to use a persistent volume for shuffle data at the job level.

You must have a compatible storage class that supports local volumes.

Copy the following manifest into a new file, named cde-storageclass.yaml:

apiVersion: v1
kind: Namespace
metadata:
  name: local-path-storage-cde

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: local-path-provisioner-service-account
  namespace: local-path-storage-cde

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: local-path-provisioner-role-cde
rules:
  - apiGroups: [ "" ]
    resources: [ "nodes", "persistentvolumeclaims", "configmaps" ]
    verbs: [ "get", "list", "watch" ]
  - apiGroups: [ "" ]
    resources: [ "endpoints", "persistentvolumes", "pods" ]
    verbs: [ "*" ]
  - apiGroups: [ "" ]
    resources: [ "events" ]
    verbs: [ "create", "patch" ]
  - apiGroups: [ "storage.k8s.io" ]
    resources: [ "storageclasses" ]
    verbs: [ "get", "list", "watch" ]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: local-path-provisioner-bind-cde
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: local-path-provisioner-role-cde
subjects:
  - kind: ServiceAccount
    name: local-path-provisioner-service-account
    namespace: local-path-storage-cde

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: local-path-provisioner-cde
  namespace: local-path-storage-cde
spec:
  replicas: 1
  selector:
    matchLabels:
      app: local-path-provisioner-cde
  template:
    metadata:
      labels:
        app: local-path-provisioner-cde
    spec:
      serviceAccountName: local-path-provisioner-service-account
      containers:
        - name: local-path-provisioner
          image: <YOUR_REGISTRY>/cloudera_thirdparty/rancher/local-path-provisioner:v0.0.31
          imagePullPolicy: IfNotPresent
          command:
            - local-path-provisioner
            - --debug
            - start
            - --config
            - /etc/config/config.json
            - --provisioner-name
            - rancher.io/local-path-cde
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config/
          env:
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
      volumes:
        - name: config-volume
          configMap:
            name: local-path-config

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-path-cde
provisioner: rancher.io/local-path-cde
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete

---
kind: ConfigMap
apiVersion: v1
metadata:
  name: local-path-config
  namespace: local-path-storage-cde
data:
  config.json: |-
    {
            "nodePathMap":[
            {
                    "node":"DEFAULT_PATH_FOR_NON_LISTED_NODES",
                    "paths":["/ecs/ecs/local-storage-cde"]
            }
            ]
    }
  setup: |-
    #!/bin/sh
    while getopts "m:s:p:" opt
    do
      case $opt in
      p)
      absolutePath=$OPTARG
      ;;
      s)
      sizeInBytes=$OPTARG
      ;;
      m)
      volMode=$OPTARG
      ;;
      esac
    done

    mkdir -m 700 -p ${absolutePath}
    chown 1345:1345 ${absolutePath}
  teardown: |-
    #!/bin/sh
    while getopts "m:s:p:" opt
    do
      case $opt in
      p)
      absolutePath=$OPTARG
      ;;
      s)
      sizeInBytes=$OPTARG
      ;;
      m)
      volMode=$OPTARG
      ;;
      esac
    done

    rm -rf ${absolutePath}
  helperPod.yaml: |-
    apiVersion: v1
    kind: Pod
    metadata:
      name: helper-pod
    spec:
      containers:
      - name: helper-pod
        image: <YOUR_REGISTRY>/cloudera_thirdparty/hardened/busybox:glibc-1.37.0-r0-202410311742
        imagePullPolicy: IfNotPresent
        securityContext:
          privileged: true
          runAsUser: 0         # Run as root user
          runAsGroup: 0        # Use root group

In the cde-storageclass.yaml file that you have created, locate and replace the <YOUR REGISTRY> placeholder with your container registry path.
Open a terminal with kubectl access to your cluster and run the following command to set up the storage class for Cloudera Data Engineering:
```
kubectl apply -f cde-storageclass.yaml
```

When creating a new job or editing an existing job in the Cloudera Data Engineering UI, add the required configurations in the Job details > Configurations field.

For example, if you use the local-path-cde local volume storage class and require a volume size of up to 10 Gi, add the following configurations:


Configuration key	Value
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path	`/data`
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly	`false`
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName	`OnDemand`
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit	`10Gi`
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass	`local-path-cde`