Ephemeral storage

Ephemeral storage space is scratch space that a Cloudera AI session, job, application or model can use. This feature helps in better scheduling of Cloudera AI pods, and provides a safety valve to ensure runaway computations do not consume all available scratch space on the node.

By default, each user pod in Cloudera AI is allocated 0 GB of scratch space, and it is allowed to use up to 10 GB. These settings can be applied to an entire site, or on a per-project basis.

How Spark uses ephemeral storage in Cloudera AI

Spark drivers and executors write shuffle files, spilled RDD/DataFrame blocks, broadcast variables, and task logs under directories referenced by SPARK_LOCAL_DIRS.

On Kubernetes these paths are mounted as one emptyDir volume per pod; emptyDir is wiped as soon as the pod terminates, so the data is ephemeral.

If this volume fills up, the kubelet evicts the pod and Spark surfaces errors such as:
  • java.io.IOException: No space left on device
  • org.apache.spark.shuffle.MetadataFetchFailedException

This is followed by a Kubernetes event similar to Evicted: The node was low on resource: ephemeral‑storage.

How does the CML UI map to Kubernetes resources

Table 1. Mapping to Kubernetes Resources
CML field Pod spec element What it does
Ephemeral Storage (GB) – Request resources.requests.ephemeral-storage Scheduler bin‑packing & cluster‑autoscaler logic
Ephemeral Storage (GB) – Max resources.limits.ephemeral-storage Hard ceiling; usage ≥ limit → pod eviction
Both the driver and every executor inherit the values you set here (or an override in Project Settings → Advanced).

Sizing guidelines for common Spark workloads

Table 2. Sizing Guidelines
Work‑load pattern Rule of thumb (across all executors) Rationale
SQL/ETL with light aggregations ≈ 1 × largest input size Minimal shuffle spill
Joins, `groupByKey`, heavy shuffle 2 – 3 × largest input size Shuffle writes often exceed input volume
ML pipelines with .cache() / .persist() Cached dataset size × #replicas Cached blocks are duplicated
Quick workflow: Start with a generous limit, run once, open Spark UI → Executors → Shuffle Spill (Disk), and set the per‑pod limit to peak spill ÷ #executors.

Tips to reduce Spark’s scratch‑disk footprint

Table 3.
Goal Knob Notes
Fewer shuffle bytes spark.sql.shuffle.partitions (closer to number of executors) and spark.sql.adaptive.enabled=true Adaptive Query Execution coalesces partitions on the fly
Eliminate shuffle joins Broadcast the small side: '/*+ BROADCAST(t) */' Keeps data in RAM when feasible
Compress spill data Ensure spark.shuffle.compress=true (default) Small CPU cost, large disk savings
Use RAM‑backed volumes (SSD‑less nodes) spark.kubernetes.local.dirs.tmpfs=true and raise spark.kubernetes.{driver,executor}.memoryOverheadFactor Mounts emptyDir as tmpfs
Persist scratch across pod restarts Mount a PVC at /spark-local with spark.kubernetes.executor.volumes.persistentVolumeClaim.<name>.mount.path Gives Spark a dedicated disk

Change site-wide ephemeral storage configuration

In Site Administration > Settings > Advanced, you can see the fields to change the ephemeral storage request (minimum) and maximum limit.

Override Site-wide ephemeral storage configuration

If you want to customize the ephemeral storage settings, you can do so on a per-project basis. Open your project, then click on Project Settings > Advanced and adjust the ephemeral storage parameters.