Ephemeral storage

Ephemeral storage space is scratch space that a Cloudera AI session, job, application or model can use. This feature helps in better scheduling of Cloudera AI pods, and provides a safety valve to ensure runaway computations do not consume all available scratch space on the node.

By default, each user pod in Cloudera AI is allocated 0 GB of scratch space, and it is allowed to use up to 10 GB. These settings can be applied to an entire site, or on a per-project basis.

How Spark uses ephemeral storage in Cloudera AI

Spark drivers and executors write shuffle files, spilled RDD/DataFrame blocks, broadcast variables, and task logs under directories referenced by SPARK_LOCAL_DIRS.

On Kubernetes these paths are mounted as one emptyDir volume per pod; emptyDir is wiped as soon as the pod terminates, so the data is ephemeral.

If this volume fills up, the kubelet evicts the pod and Spark surfaces errors such as:

java.io.IOException: No space left on device
org.apache.spark.shuffle.MetadataFetchFailedException

This is followed by a Kubernetes event similar to Evicted: The node was low on resource: ephemeral‑storage.

How does the CML UI map to Kubernetes resources

Table 1. Mapping to Kubernetes Resources
CML field	Pod spec element	What it does
Ephemeral Storage (GB) – Request	`resources.requests.ephemeral-storage`	Scheduler bin‑packing & cluster‑autoscaler logic
Ephemeral Storage (GB) – Max	`resources.limits.ephemeral-storage`	Hard ceiling; usage ≥ limit → pod eviction

Both the driver and every executor inherit the values you set here (or an override in Project Settings → Advanced).

Sizing guidelines for common Spark workloads

Table 2. Sizing Guidelines
Work‑load pattern	Rule of thumb (across all executors)	Rationale
SQL/ETL with light aggregations	≈ 1 × largest input size	Minimal shuffle spill
Joins, `groupByKey`, heavy shuffle	2 – 3 × largest input size	Shuffle writes often exceed input volume
ML pipelines with `.cache()` / `.persist()`	Cached dataset size × #replicas	Cached blocks are duplicated

Quick workflow: Start with a generous limit, run once, open Spark UI → Executors → Shuffle Spill (Disk), and set the per‑pod limit to peak spill ÷ #executors.

Tips to reduce Spark’s scratch‑disk footprint

Table 3.
Goal	Knob	Notes
Fewer shuffle bytes	`spark.sql.shuffle.partitions` (closer to number of executors) and `spark.sql.adaptive.enabled=true`	Adaptive Query Execution coalesces partitions on the fly
Eliminate shuffle joins	Broadcast the small side: `'/+ BROADCAST(t) /'`	Keeps data in RAM when feasible
Compress spill data	Ensure `spark.shuffle.compress=true` (default)	Small CPU cost, large disk savings
Use RAM‑backed volumes (SSD‑less nodes)	`spark.kubernetes.local.dirs.tmpfs=true` and raise `spark.kubernetes.{driver,executor}.memoryOverheadFactor`	Mounts `emptyDir` as `tmpfs`
Persist scratch across pod restarts	Mount a PVC at /spark-local with `spark.kubernetes.executor.volumes.persistentVolumeClaim.<name>.mount.path`	Gives Spark a dedicated disk

Change site-wide ephemeral storage configuration

In Site Administration > Settings > Advanced, you can see the fields to change the ephemeral storage request (minimum) and maximum limit.

Override Site-wide ephemeral storage configuration

If you want to customize the ephemeral storage settings, you can do so on a per-project basis. Open your project, then click on Project Settings > Advanced and adjust the ephemeral storage parameters.