General known issues with Cloudera Data Engineering

Learn about the general known issues with the Cloudera Data Engineering (CDE) service on public clouds, the impact or changes to the functionality, and the workaround.

DEX-6873 Kubernetes 1.21 will fail service account token renewal after 90 days
Cloudera Data Engineering (CDE) on AWS running version CDE 1.14 and above using Kubernetes 1.21 will observe failed jobs after 90 days of service uptime.
Restart specific components to force regenerate the token using one of the following options:

Option 1) Using kubectl:

  1. Setup kubectl for CDE.
  2. Delete calico-node pods.
    kubectl delete pods --selector k8s-app=calico-node --namespace kube-system
  3. Delete Livy pods for all Virtual Clusters.
    kubectl delete pods --selector app.kubernetes.io/name=livy --all-namespaces

    If for some reason only one Livy pod needs to be fixed:

    1. Find the virtual cluster ID through the UI under Cluster Details.
    2. Delete Livy pod:
      export VC_ID=<VC ID>
      kubectl delete pods --selector app.kubernetes.io/name=livy --namespace ${​VC_ID}

Option 2) Using K8s dashboard

  1. On the Service Details page copy the RESOURCE SCHEDULER link.
  2. Replace yunikorn part with the dashboard and open the resulting link in the browser.
  3. In the top left corner find the namespaces dropdown and choose All namespaces.
  4. Search for calico-node.
  5. For each pod in the Pods table click the Delete option from the hamburger menu.
  6. Search for livy.
  7. For each pod in the Pods table click the Delete option from the hamburger menu.
  8. If for some reason only one Livy pod needs to be fixed, find the Virtual Cluster ID through the UI under Cluster Details and only delete the pod with the name starting with Virtual Cluster ID.
DEX-7286 In place upgrade (Technical Preview) issue: Certificate expired showing error in browser
Certificates failure after an in-place upgrade from 1.14.
Start the certificate upgrade:

Get cluster ID

  1. Navigate to the Cloudera Data Engineering Overview page by clicking the Data Engineering tile in the Cloudera Data Platform (CDP) management console.
  2. Edit device details.
  3. Copy cluster ID filed into click board.
  4. In a terminal set the CID environment variable to this value.
    export CID=cluster-1234abcd

Get session token

  1. Navigate to the Cloudera Data Engineering Overview page by clicking the Data Engineering tile in the Cloudera Data Platform (CDP) management console.
  2. Right click and select Inspect
  3. Click the Application tab.

    4. Click Cookies and select the URL of the console.

    5. Select cdp-session-token.

    6. Double click the displayed cookie value and right click and select Copy.

    7. Open a terminal screen.
    export CST=<Paste value of cookie here>

Force TLS certificate update

curl -b cdp-session-token=${CST}  -X 'PATCH' -H 'Content-Type: application/json' -d '{"status_update":"renewTLSCerts"}' 'https://<URL OF CONSOLE>/dex/api/v1/cluster/$CID'
DEX-7051 EnvironmentPriviledgedUser role cannot be used with CDE
The role EnvironmentPriviledgedUser cannot currently be used by a user if a user wants to access CDE. If a user has this role, then this user will not be able to interact with CDE as an "access denied" would occur.
Cloudera recommends to not use or assign the EnvironmentPrivilegedUser role for accessing CDE.
CDPD-40396 Iceberg migration fails on partitioned Hive table created by Spark without location
Iceberg provides a migrate procedure for migrating a Parquet/ORC/Avro Hive table to Iceberg. If the table was created using Spark without specifying location and is partitioned, the migration fails.
By default, the table has a TRANSLATED_TO_EXTERNAL property and that is set to true. Unset this property by running ALTER TABLE ... UNSET TBLPROPERTIES ('TRANSLATED_TO_EXTERNAL') and then run the migrate procedure.
Strict DAG declaration in Airflow 2.2.5
CDE 1.16 introduces Airflow 2.2.5 which is now stricter about DAG declaration than the previously supported Airflow version in CDE. In Airflow 2.2.5, DAG timezone should be a pendulum.tz.Timezone, not datetime.timezone.utc.
If you upgrade to CDE 1.16, make sure that you have updated your DAGs according to the Airflow documentation, otherwise your DAGs will not be able to be created in CDE and the restore process will not be able to restore these DAGs.

Example of valid DAG:

import pendulum 
dag = DAG("my_tz_dag", start_date=pendulum.datetime(2016, 1, 1, tz="Europe/Amsterdam")) 
op = DummyOperator(task_id="dummy", dag=dag)

Example of invalid DAG:

from datetime import timezone
from dateutil import parser
dag = DAG("my_tz_dag", start_date=parser.isoparse('2020-11-11T20:20:04.268Z').replace(tzinfo=timezone.utc)) 
op = DummyOperator(task_id="dummy", dag=dag)
COMPX-5494: Yunikorn recovery intermittently deletes existing placeholders
On recovery, Yunikorn may intermittently delete placeholder pods. After recovery, there may be remaining placeholder pods. This may cause unexpected behavior during rescheduling.
There is no workaround for this issue. To avoid any unexpected behavior, Cloudera suggests removing all the placeholders manually before restarting the scheduler.
DWX-8257: CDW Airflow Operator does not support SSO

Although Virtual Warehouse (VW) in Cloudera Data Warehouse (CDW) supports SSO, this is incompatible with the CDE Airflow service as, for the time being, the Airflow CDW Operator only supports workload username/password authentication.

Disable SSO in the VW.
COMPX-7085: Scheduler crashes due to Out Of Memory (OOM) error in case of clusters with more than 200 nodes

Resource requirement of the YuniKorn scheduler pod depends on cluster size, that is, the number of nodes and the number of pods. Currently, the scheduler is configured with a memory limit of 2Gi. When running on a cluster that has more than 200 nodes, the memory limit of 2Gi may not be enough. This can cause the scheduler to crash because of OOM.

Increase resource requests and limits for the scheduler. Edit the YuniKorn scheduler deployment to increase the memory limit to 16Gi.

For example:

resources: 
  limits: 
    cpu: "4"
    memory: 16Gi
 requests: 
    cpu: "2"
    memory: 8Gi
COMPX-6949: Stuck jobs prevent cluster scale down

Because of hanging jobs, the cluster is unable to scale down even when there are no ongoing activities. This may happen when some unexpected node removal occurs, causing some pods to be stuck in Pending state. These pending pods prevent the cluster from downscaling.

Terminate the jobs manually.
DEX-3997: Python jobs using virtual environment fail with import error
Running a Python job that uses a virtual environment resource fails with an import error, such as:
Traceback (most recent call last):
  File "/tmp/spark-826a7833-e995-43d2-bedf-6c9dbd215b76/app.py", line 3, in <module>
    from insurance.beneficiary import BeneficiaryData
ModuleNotFoundError: No module named 'insurance'
Do not set the spark.pyspark.driver.python configuration parameter when using a Python virtual environment resource in a job.