General guidelines
Learn more about general guidelines, known issues, and limitations while scaling Cloudera Data Engineering deployments.
Typically, the total physical resources of a cluster are available for all Spark and Airflow
jobs. However, in Cloudera Data Engineering
on premises, capacity is consumed in a stack, and Cloudera Data Engineering jobs can only access the remaining unused resources. The
following layers of consumption exist:
- Physical hardware – The total bare-metal or virtual machine capacity.
- Platform overhead – Resources consumed by the host Operating System, Kubernetes services, and platform-level daemonsets for storage and networking.
- Cloudera Data Services on premises overhead – Resources consumed by the Cloudera Data Services on premises infrastructure, including the Cloudera Control Plane and Cloudera Shared Data Experience services.
- Cloudera Data Engineering Service and Virtual Cluster infrastructure overhead – A fixed amount of resources consumed by the Cloudera Data Engineering service and each Virtual Cluster's internal pods such as Airflow or API server.
- Available Job Capacity – The remaining resources available to run Spark executors and Airflow task pods.
Limitations
- Resource fragmentation – Jobs can remain in the
Pendingstate even when monitoring dashboards indicate free resources in the cluster. This often occurs when the available capacity is scattered across multiple nodes in small chunks, none of which are large enough to host the job. The following resource fragmentation cases exist:- CPU Fragmentation – For example, a cluster might
have 8 available CPU cores, but if this capacity is distributed as 0.5 cores on 16
different nodes, a Spark job requesting a single 4-core executor cannot be
scheduled. The job remains in the
Pendingstate because no single node can satisfy its request. - Memory Fragmentation – For example, a cluster might have 100 GB of free memory, but if this memory is distributed into 10 GB chunks on 10 different nodes, a job requesting an executor with 20 GB of RAM cannot be scheduled.
- CPU Fragmentation – For example, a cluster might
have 8 available CPU cores, but if this capacity is distributed as 0.5 cores on 16
different nodes, a Spark job requesting a single 4-core executor cannot be
scheduled. The job remains in the
- Control Log Ingestion Rate – Platform logging services have a maximum ingestion capacity of Disk Write Rate in the order of 450 Kib/second per pod. To ensure log reliability, Spark jobs must not generate logs at an excessive rate. If a job log output significantly exceeds this rate, it can overwhelm the logging pipeline and result in log loss.
- Memory Overhead (Java or Scala vs Python) – For
memory-critical workloads, the implementation language can matter. The lower the
overhead of the job, the more executors can be scheduled. For example, for
memory-intensive workloads, the Java or Scala job required a 30% memory overhead for
stable execution, while the equivalent Python job required a 40% overhead. The lower
overhead of the Java or Scala job allowed more executors to be scheduled, improving
concurrency.
Configuration override – Memory overhead can be configured per job to percentage or absolute value.
- Job Level Configuration – You can set these
parameters directly in the Spark configurations for each job.
- General overhead
factor
spark.kubernetes.memoryOverheadFactor=0.2This sets a 20% overhead for both driver and executor.
- Executor-specific
overhead
spark.executor.memoryOverhead=2gThis sets a fixed 2 GB overhead for executors regardless of the executor memory size.
- Driver-specific
overhead
spark.driver.memoryOverhead=1gThis sets a fixed 1 GB overhead for the driver.
- General overhead
factor
- Virtual Cluster Level Configuration – The memory
overhead factors are defined at the virtual cluster level and can be adjusted
there.
defaultMemoryOverheadFactor(for regular JVM jobs)nonJVMMemoryOverheadFactor(for PySpark jobs)
- Job Level Configuration – You can set these
parameters directly in the Spark configurations for each job.
