Learn about the general known issues with the Cloudera Data
Engineering (CDE) service on public clouds, the impact or changes to the
functionality, and the workaround.
- DEX-6873 Kubernetes 1.21 will fail service account token renewal
after 90 days
- Cloudera Data Engineering (CDE) on AWS running version CDE 1.14
and above using Kubernetes 1.21 will observe failed jobs after 90 days of service
uptime.
- Restart specific components to force regenerate the token
using one of the following options:
-
Option 1) Using kubectl:
- Setup kubectl for CDE.
- Delete calico-node
pods.
kubectl delete pods --selector k8s-app=calico-node --namespace kube-system
- Delete Livy pods for all Virtual
Clusters.
kubectl delete pods --selector app.kubernetes.io/name=livy --all-namespaces
If
for some reason only one Livy pod needs to be fixed:
- Find the virtual cluster ID through the UI under Cluster
Details.
- Delete Livy pod:
export VC_ID=<VC ID>
kubectl delete pods --selector app.kubernetes.io/name=livy --namespace ${VC_ID}
Option 2) Using K8s dashboard
-
- On the Service Details page copy the RESOURCE
SCHEDULER link.
- Replace yunikorn part with the
dashboard and open the resulting link in the browser.
- In the top left corner find the namespaces dropdown and choose All
namespaces.
- Search for calico-node.
- For each pod in the Pods table click the
Delete option from the hamburger menu.
- Search for livy.
- For each pod in the Pods table click the
Delete option from the hamburger menu.
- If for some reason only one Livy pod needs to be fixed, find the Virtual Cluster ID
through the UI under Cluster Details and only delete the pod
with the name starting with Virtual Cluster ID.
- DEX-7286 In place upgrade (Technical Preview) issue: Certificate
expired showing error in browser
- Certificates failure after an in-place upgrade from 1.14.
- Start the certificate upgrade:
-
Get cluster ID
- Navigate to the Cloudera Data Engineering Overview page by
clicking the Data Engineering tile in the Cloudera
Data Platform (CDP) management console.
- Edit device details.
- Copy cluster ID filed into click board.
- In a terminal set the CID environment variable to this
value.
export CID=cluster-1234abcd
Get session token
- Navigate to the Cloudera Data Engineering Overview page by
clicking the Data Engineering tile in the Cloudera
Data Platform (CDP) management console.
- Right click and select Inspect
- Click the Application tab.
4. Click
Cookies and select the URL of the console.
5.
Select cdp-session-token.
6. Double click the
displayed cookie value and right click and select
Copy.
7. Open a terminal
screen.
export CST=<Paste value of cookie here>
Force TLS certificate update
curl -b cdp-session-token=${CST} -X 'PATCH' -H 'Content-Type: application/json' -d '{"status_update":"renewTLSCerts"}' 'https://<URL OF CONSOLE>/dex/api/v1/cluster/$CID'
- DEX-7051
EnvironmentPriviledgedUser
role cannot be used with CDE
- The role
EnvironmentPriviledgedUser
cannot currently be used by a user
if a user wants to access CDE. If a user has this role, then this user will not be able to
interact with CDE as an "access denied" would occur.
- Cloudera recommends to not use or assign the
EnvironmentPrivilegedUser
role for accessing CDE.
- CDPD-40396 Iceberg migration fails on partitioned Hive table created by Spark without
location
- Iceberg provides a migrate procedure for migrating a Parquet/ORC/Avro Hive table to
Iceberg. If the table was created using Spark without specifying location and is
partitioned, the migration fails.
- By default, the table has a
TRANSLATED_TO_EXTERNAL
property and that
is set to true
. Unset this property by running ALTER TABLE ...
UNSET TBLPROPERTIES ('TRANSLATED_TO_EXTERNAL')
and then run the
migrate
procedure.
- Strict DAG declaration in Airflow 2.2.5
- CDE 1.16 introduces Airflow 2.2.5 which is now stricter about DAG declaration than the
previously supported Airflow version in CDE. In Airflow 2.2.5, DAG timezone should be a
pendulum.tz.Timezone
, not datetime.timezone.utc
.
- If you upgrade to CDE 1.16, make sure that you have updated your
DAGs according to the Airflow documentation, otherwise your DAGs will
not be able to be created in CDE and the restore process will not be able to restore
these DAGs.
Example of valid
DAG:
import pendulum
dag = DAG("my_tz_dag", start_date=pendulum.datetime(2016, 1, 1, tz="Europe/Amsterdam"))
op = DummyOperator(task_id="dummy", dag=dag)
Example of invalid
DAG:
from datetime import timezone
from dateutil import parser
dag = DAG("my_tz_dag", start_date=parser.isoparse('2020-11-11T20:20:04.268Z').replace(tzinfo=timezone.utc))
op = DummyOperator(task_id="dummy", dag=dag)
- COMPX-5494: Yunikorn recovery intermittently deletes existing placeholders
- On recovery, Yunikorn may intermittently delete placeholder pods. After recovery, there
may be remaining placeholder pods. This may cause unexpected behavior during
rescheduling.
- There is no workaround for this issue. To avoid any unexpected behavior, Cloudera
suggests removing all the placeholders manually before restarting the
scheduler.
- DWX-8257: CDW Airflow Operator does not support SSO
-
Although Virtual Warehouse (VW) in Cloudera Data Warehouse (CDW) supports SSO, this is
incompatible with the CDE Airflow service as, for the time being, the Airflow CDW
Operator only supports workload username/password authentication.
- Disable SSO in the VW.
- COMPX-7085: Scheduler crashes due to Out Of Memory (OOM) error in case of clusters with
more than 200 nodes
-
Resource requirement of the YuniKorn scheduler pod depends on cluster size, that is,
the number of nodes and the number of pods. Currently, the scheduler is configured with
a memory limit of 2Gi. When running on a cluster that has more than 200 nodes, the
memory limit of 2Gi may not be enough. This can cause the scheduler to crash because of
OOM.
-
Increase resource requests and limits for the scheduler. Edit the YuniKorn
scheduler deployment to increase the memory limit to 16Gi.
For
example:
resources:
limits:
cpu: "4"
memory: 16Gi
requests:
cpu: "2"
memory: 8Gi
- COMPX-6949: Stuck jobs prevent cluster scale down
-
Because of hanging jobs, the cluster is unable to scale down even when there are no
ongoing activities. This may happen when some unexpected node removal occurs, causing
some pods to be stuck in Pending state. These pending pods prevent the cluster from
downscaling.
- Terminate the jobs manually.
- DEX-3997: Python jobs using virtual environment fail with import error
- Running a Python job that uses a virtual environment resource fails with an import
error, such as:
-
Traceback (most recent call last):
File "/tmp/spark-826a7833-e995-43d2-bedf-6c9dbd215b76/app.py", line 3, in <module>
from insurance.beneficiary import BeneficiaryData
ModuleNotFoundError: No module named 'insurance'
- Do not set the
spark.pyspark.driver.python
configuration parameter when using a Python virtual environment resource in a job.