Troubleshooting: Configuring the Spark engine to optimize the rewrite compaction job
Optimize RewriteDatafiles (compaction) task in a Cloudera Lakehouse Optimizer policy by configuring the Spark engine, identifying and removing an overloaded YARN Node Manager, and tuning policy parameters.
Problem
The Cloudera Lakehouse Optimizer policy for the RewriteDatafiles (compaction) task failed on a very large Iceberg table with the java.lang.RuntimeException: Failed while running parallel task error caused by the java.lang.OutOfMemoryError: Java heap space error.
The Iceberg table in this use case has the following details:
- Snapshots = 9,603
- Total size = ~708 GB
- Data files = 2,728,148 (numerous very small files)
- Partitions = 100 buckets
Cause
The RewriteDatafiles task failed because of the following cascade of errors:
- The table contained millions of small data files, which generated a large number of manifest files. During the compaction process, the Spark driver had to scan and decompress these manifests to plan the file groups. This process created a driver heap memory causing the out-of-memory (OOM) error.
- One of the YARN nodes was running additional heavy services (HDFS NameNode, HBase RegionServer) that consumed a lot of memory. This caused a cluster resource crunch, severely reducing the effective memory available for Spark and YARN workloads.
Solution
Outcome
- The service successfully scanned Iceberg metadata and planned file groups.
- The service started rewriting file groups.
- In this specific scenario, the optimization reduced the number of files from ~2.2M to ~0.7M.
