Configuring Spark jobs for large shuffle data

Learn about configuring the Spark jobs to use the persistent volumes for shuffle data to improve the performance or handle the huge shuffle data.

By default, Cloudera Data Engineering service does not use any persistent volume for shuffle data causing the shuffle data to spill over to the local disks. If the local disk space is not sufficient or local disk performance decreases, you must configure the Spark jobs to use a persistent volume for shuffle data at the job level.

  • You must have a compatible storage class that supports local volumes.

  1. Copy the following manifest into a new file, named cde-storageclass.yaml:
    apiVersion: v1
    kind: Namespace
    metadata:
      name: local-path-storage-cde
    
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: local-path-provisioner-service-account
      namespace: local-path-storage-cde
    
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: local-path-provisioner-role-cde
    rules:
      - apiGroups: [ "" ]
        resources: [ "nodes", "persistentvolumeclaims", "configmaps" ]
        verbs: [ "get", "list", "watch" ]
      - apiGroups: [ "" ]
        resources: [ "endpoints", "persistentvolumes", "pods" ]
        verbs: [ "*" ]
      - apiGroups: [ "" ]
        resources: [ "events" ]
        verbs: [ "create", "patch" ]
      - apiGroups: [ "storage.k8s.io" ]
        resources: [ "storageclasses" ]
        verbs: [ "get", "list", "watch" ]
    
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: local-path-provisioner-bind-cde
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: local-path-provisioner-role-cde
    subjects:
      - kind: ServiceAccount
        name: local-path-provisioner-service-account
        namespace: local-path-storage-cde
    
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: local-path-provisioner-cde
      namespace: local-path-storage-cde
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: local-path-provisioner-cde
      template:
        metadata:
          labels:
            app: local-path-provisioner-cde
        spec:
          serviceAccountName: local-path-provisioner-service-account
          containers:
            - name: local-path-provisioner
              image: <YOUR_REGISTRY>/cloudera_thirdparty/rancher/local-path-provisioner:v0.0.31
              imagePullPolicy: IfNotPresent
              command:
                - local-path-provisioner
                - --debug
                - start
                - --config
                - /etc/config/config.json
                - --provisioner-name
                - rancher.io/local-path-cde
              volumeMounts:
                - name: config-volume
                  mountPath: /etc/config/
              env:
                - name: POD_NAMESPACE
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.namespace
          volumes:
            - name: config-volume
              configMap:
                name: local-path-config
    
    ---
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: local-path-cde
    provisioner: rancher.io/local-path-cde
    volumeBindingMode: WaitForFirstConsumer
    reclaimPolicy: Delete
    
    ---
    kind: ConfigMap
    apiVersion: v1
    metadata:
      name: local-path-config
      namespace: local-path-storage-cde
    data:
      config.json: |-
        {
                "nodePathMap":[
                {
                        "node":"DEFAULT_PATH_FOR_NON_LISTED_NODES",
                        "paths":["/ecs/ecs/local-storage-cde"]
                }
                ]
        }
      setup: |-
        #!/bin/sh
        while getopts "m:s:p:" opt
        do
          case $opt in
          p)
          absolutePath=$OPTARG
          ;;
          s)
          sizeInBytes=$OPTARG
          ;;
          m)
          volMode=$OPTARG
          ;;
          esac
        done
    
        mkdir -m 700 -p ${absolutePath}
        chown 1345:1345 ${absolutePath}
      teardown: |-
        #!/bin/sh
        while getopts "m:s:p:" opt
        do
          case $opt in
          p)
          absolutePath=$OPTARG
          ;;
          s)
          sizeInBytes=$OPTARG
          ;;
          m)
          volMode=$OPTARG
          ;;
          esac
        done
    
        rm -rf ${absolutePath}
      helperPod.yaml: |-
        apiVersion: v1
        kind: Pod
        metadata:
          name: helper-pod
        spec:
          containers:
          - name: helper-pod
            image: <YOUR_REGISTRY>/cloudera_thirdparty/hardened/busybox:glibc-1.37.0-r0-202410311742
            imagePullPolicy: IfNotPresent
            securityContext:
              privileged: true
              runAsUser: 0         # Run as root user
              runAsGroup: 0        # Use root group
  2. In the cde-storageclass.yaml file that you have created, locate and replace the <YOUR REGISTRY> placeholder with your container registry path.
  3. Open a terminal with kubectl access to your cluster and run the following command to set up the storage class for Cloudera Data Engineering:
    kubectl apply -f cde-storageclass.yaml
  4. When creating a new job or editing an existing job in the Cloudera Data Engineering UI, add the required configurations in the Job details > Configurations field.
    For example, if you use the local-path-cde local volume storage class and require a volume size of up to 10 Gi, add the following configurations:
    Configuration key Value
    spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path /data
    spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly false
    spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName OnDemand
    spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit 10Gi
    spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass local-path-cde