General known issues with Cloudera Data Engineering
Learn about the general known issues with the Cloudera Data Engineering (CDE) service on public clouds, the impact or changes to the functionality, and the workaround.
- COMPX-7085: Scheduler crashes due to Out Of Memory (OOM) error in case of clusters with more than 200 nodes
Resource requirement of the YuniKorn scheduler pod depends on cluster size, that is, the number of nodes and the number of pods. Currently, the scheduler is configured with a memory limit of 2Gi. When running on a cluster that has more than 200 nodes, the memory limit of 2Gi may not be enough. This can cause the scheduler to crash because of OOM.
Increase resource requests and limits for the scheduler. Edit the YuniKorn scheduler deployment to increase the memory limit to 16Gi.
resources: limits: cpu: "4" memory: 16Gi requests: cpu: "2" memory: 8Gi
- COMPX-6949: Stuck jobs prevent cluster scale down
Because of hanging jobs, the cluster is unable to scale down even when there are no ongoing activities. This may happen when some unexpected node removal occurs, causing some pods to be stuck in Pending state. These pending pods prevent the cluster from downscaling.
- Workaround: Terminate the jobs manually.
- DEX-3997: Python jobs using virtual environment fail with import error
- Running a Python job that uses a virtual environment resource fails with an import error, such as:
Traceback (most recent call last): File "/tmp/spark-826a7833-e995-43d2-bedf-6c9dbd215b76/app.py", line 3, in <module> from insurance.beneficiary import BeneficiaryData ModuleNotFoundError: No module named 'insurance'
- Workaround: Do not set the
spark.pyspark.driver.pythonconfiguration parameter when using a Python virtual environment resource in a job.