Cloudera Data Engineering known issues archive

DEX-14094: New pipeline editor can create a cde_job step of its own pipeline job which causes recursion and looping

If you add a Cloudera Data Engineering Job step in the pipeline editor and select the same job as the pipeline job from the Select Job drop-down list while configuring the pipeline job using the editor, then running the pipeline job results in a recursive loop.

For example, You have created a pipeline job named test-dag and selected the same job test-dag from the Select Job drop-down list while adding the Cloudera Data Engineering Job step, then running the pipeline job results in a recursive loop.

This issue is resolved.

DEX-13465: Cloudera Data Engineering Airflow DAG Code Fetch is not working

The embedded Airflow UI within the Cloudera Data Engineering Job pages does not correctly show the "Code" view.

Open the Jobs page through Virtual Cluster link and use the old Cloudera Data Engineering UI or manually open Airflow UI through Virtual Cluster details page and navigate to the appropriate DAG details.

This issue is resolved.

DEX-14067: Email alerting is not sending email to the recipient even after the job fails

When a Virtual Cluster is created through the UI, SMTP configuration and authentication is empty. Enabling Email Alerting after VC creation allows users to add the SMTP parameters, but the updated password is not fetched when sending email alerts.

Restart the runtime-api pod after adding the SMTP configuration. On restart the secrets are re-fetched and all services are reinitialized with the updated secrets.

List all pods in the namespace (for example, dex-app-dtznsgc5):
```
kubectl get pods -n [***NAMESPACE***]
```
Copy the runtime-api pod name. (For example, dex-app-dtznsgc5-api-b5f8d7cb9-gdbxb.)

Delete the runtime-api pod:

kubectl delete pod [***POD NAME***] -n [***NAMESPACE***]

A new instance of the runtime-api pod will get initialized.
Test email alerting. (For example, via a job run failure.)

This issue is resolved.

DEX-14027: Spark 3.5.1 jobs are failing with error 'org.apache.hadoop.fs.s3a.impl.InstantiationIOException'

Spark 3.5.1 RAZ-enabled Cloudera Data Engineering clusters fail to initialize RAZ S3 plugin library due to recent backward incompatible changes in the library, and jobs fail with error org.apache.hadoop.fs.s3a.impl.InstantiationIOException.

Depending on the configuration level, multiple following workarounds are available through the API or CLI, since setting empty parameters in the UI is not possible.

(preferred) as a VC-level config setting (not applicable to Cloudera Data Engineering sessions), with the API, use:

curl '[***CLUSTER APP INSTANCE API ENDPOINT***]'; \
  -X 'PATCH' \
  -H 'Connection: keep-alive' \
  -H 'Content-Type: application/json' \
  -H  'Cookie: [***COOKIE CONTAINING CDP-TOKEN***]' \
  --data-raw '{"config":{"sparkConfigs":{"spark.hadoop.fs.s3a.http.signer.class":"org.apache.ranger.raz.hook.s3.RazS3SignerPlugin","spark.hadoop.fs.s3a.http.signer.enabled":"true","spark.hadoop.fs.s3a.custom.signers":"","spark.hadoop.fs.s3a.s3.signing-algorithm":""}}}'

as a per session configuration (VC-level configuration will not be applied), in the CLI, run:

./cde --vcluster-endpoint [***CLUSTER APP INSTANCE API ENDPOINT***] session create
--name [***CDE SESSION NAME***] \
--conf "spark.hadoop.fs.s3a.custom.signers=" \
--conf "spark.hadoop.fs.s3a.http.signer.class=org.apache.ranger.raz.hook.s3.RazS3SignerPlugin" \
--conf "spark.hadoop.fs.s3a.http.signer.enabled=true" \
--conf "spark.hadoop.fs.s3a.s3.signing-algorithm="

in the CLI, at job creation, run:

./cde --vcluster-endpoint [***VIRTUAL CLUSTER JOB API URL***] job run --name [***CDE JOB NAME***]> \
--conf "spark.hadoop.fs.s3a.custom.signers=" \
--conf "spark.hadoop.fs.s3a.http.signer.class=org.apache.ranger.raz.hook.s3.RazS3SignerPlugin" \
--conf "spark.hadoop.fs.s3a.http.signer.enabled=true" \
--conf "spark.hadoop.fs.s3a.s3.signing-algorithm="

This issue is resolved.

DEX-13975: 'spark.catalog.listTables()' command in job is failing in Python Spark for Spark 3.5.1

Using catalog.listTables() with Iceberg tables results in an exception org.apache.spark.sql.catalyst.parser.ParseException: [PARSE_SYNTAX_ERROR] Syntax error at or near end of input.

Set spark.sql.legacy.useV1Command=true for running catalog.listTables().

This issue is resolved.

DEX-12630: Cloudera Data Engineering Service failed to run jobs on SSD Based clusters

When a customer creates a new Cloudera Data Engineering service with an SSD Instance enabled on Cloudera Data Engineering version greater than or equal to 1.19.4, Spark and Airflow jobs do not start at all. The same problem happens if an existing Cloudera Data Engineering service is upgraded to 1.19.3 or greater and has SSD Instance enabled.

Create a new Cloudera Data Engineering service without SSD Instance enabled until 1.20.3-h1. From 1.20.3-h2, you can create a new Cloudera Data Engineering service with SSD Instance enabled. However, you cannot upgrade existing SSD based service, therefore you must create a new one.

This issue is resolved.

DEX 12451: Service creation fails when "Enable Public Loadbalancer" is selected in an environment with only private subnets

When creating a service with the Enable Public Load Balancer option selected, the service creation fails with the following error:

“CDE Service: 1.20.3-b15 ENV: dsp-storage-mow-priv (in mow-priv) and dsp-storage-aws-dev-newvpc (mw-dev) – Environment is configured only with private subnets, there are no public subnets. dex-base installation failed, events: [\{"ResourceKind":"Service","ResourceName":"dex-base-nginx-56g288bq-controller","ResourceNamespace":"dex-base-56g288bq","Message":"Error syncing load balancer: failed to ensure load balancer: could not find any suitable subnets for creating the ELB","Timestamp":"2024-02-08T09:55:28Z"}]

When creating a service and enabling a public load balancer, configure at least one public subnet in the environment. For more information, see Enabling a Cloudera Data Engineering service.

This issue is resolved.

ENGESC-22921: Livy pod failure (CrashLoopBackoff)

There is not enough Livy-overhead-memory causing the Livy service to crash and trigger the pod to restart.

Increase the readiness/liveness timeout to 60 seconds and Livy pod will start.

This issue is resolved.

DEX-11086: Cancelled statement(s) not canceled by Livy

Currently, Livy statements cant be cancelled immediately after using /sessions/{name}/statements/{id}/cancel. The status is returned as Cancelled but the background job continues to run.

There is a limitation on what can be cancelled. For example, if something is running on the driver exclusively, such as Thread.sleep()), it can not be cancelled.

This issue is resolved.

DEX-9939: Tier 2 Node groups are created with the same size as Tier 1 node groups during service creation. They cannot be edited during service edit

If a service is created with 1 as the minimum on-demand scale limit, two nodes will run for Tier 1 and Tier 2. Even if a service is edited with the minimum reduced to 0, the Tier 2 node will still run. This will be fixed in the Cloudera Data Engineering 1.20 release.

You must manually edit the node group parameters from the AWS or Azure console. First, locate the log titled "Started provisioning a managed cluster, provisionerID: liftie-xxxxxxxx" and locate the Liftie ID for the cluster so that you can continue with the steps below.

In AWS:

Log in to the AWS Management Console.
Navigate to EC2 > Auto Scaling groups.
Find <liftie-id>-spt2-<hash>-NodeGroup and click on the name to open the Instance Group Details page.
Under Group Details, click Edit.
Update Maximum capacity from 5 to 10, and click Update.
Repeat the steps above to update the Maximum capacity of the <liftie-id>-cmp2-5<hash>-NodeGroup from 2 to 5. The Cluster Autoscaler implements the changes and the issue will resolve. The number of simultaneously running jobs will increase.

In Azure:

Navigate to the Kubernetes Service Details > Log page.
Navigate to the Node Pools page, and locate the Node Pool starting with cmp2.
Click on Scale Node pools and edit the capacity.

This issue is resolved.

DEX-9852: FreeIPA certificate mismatch issues for new Spark 3.3 Virtual Clusters

In Cloudera Data Engineering 1.19, when creating a new Virtual Cluster based on Spark 3.3, and submitting any job in the pods, the following error occurs: "start TGT gen failed for user : rpc error: code = Unavailable desc = Operation not allowed while app is in recovery state."

Manually copy over the certificate from the FreeIPA server.

This issue is resolved.

DEX-9932: Name length causes Pod creation error for Cloudera Data Engineering Sessions

In Cloudera Data Engineering 1.19, the K8s pod name has a limitation of 63 Characters, and Cloudera Data Engineering Sessions has a name length of 56 maximum characters.

Create a Cloudera Data Engineering Session with a name of less than 56 characters to avoid the pod creation error.

This issue is resolved.

DEX-9895: Cloudera Data Engineering Virtual Cluster API response displays default Spark version as 2.4.7

In Cloudera Data Engineering 1.19, the Spark version 3.2.3 is the expected default in a Cloudera Data Engineering Spark Virtual Cluster, but Spark 2.4.7 displays instead. This issue will be fixed in Cloudera Data Engineering 1.20.

This issue is resolved.

DEX-9790: Single tab Session support for Virtual Cluster selection

In Cloudera Data Engineering 1.19, the Virtual Cluster selection in the Jobs, Resources, Runs, and Sessions page is not preserved if the user attempts to open Cloudera Data Engineering in another browser tab/window.

When you open Cloudera Data Engineering in another tab, you must re-select the Virtual Cluster that you want to use in the new tab.

This issue is resolved.

DEX-10044: Handle adding tier 2 auto scaling groups during in-place upgrades

Since auto scaling groups (ASGs) are not added or updated during the upgrade, the tier 2 ASGs are not created. This resulted in pods that were unable to be scheduled. This error applies to services created in Cloudera Data Engineering 1.18 and then upgraded to 1.19.

Create a new Cloudera Data Engineering service as this issue won't be seen on a new Cloudera Data Engineering 1.19 service because this error applies only to upgraded clusters.

This issue is resolved.

DEX-10107: Spark 3.3 in Cloudera Data Engineering 1.19 has a limitation of characters for job names

Jobs with longer names over 23 characters can fail in Spark 3.3 with the following exception:

23-05-14 10:14:16 ERROR ExecutorPodsSnapshotsStoreImpl: Going to stop
            due to IllegalArgumentException java.lang.IllegalArgumentException: '$JOB_NAME' in
            spark.kubernetes.executor.podNamePrefix is invalid. must conform https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names
            and the value length <= 47

Change the name of the job:

Clone the job with a new name using the Cloudera Data Engineering UI, CLI, or API.
Set the app name in the job itself:conf.setAppName("Custom Job Name").

This issue is resolved.

DEX-10055: Interacting with a killed session

When you interact with a long-running killed Spark session, the session might become unresponsive. Refrain from interacting with the long-running killed session. This will be fixed in a future release of Cloudera Data Engineering.

This issue is resolved.

DEX-8769: The table entity type on Atlas is spar_tables instead of hive_tables on Spark3 Virtual Clusters

Tables that are created using a Spark3 Virtual Cluster on an AWS setup will have spark_tables type instead of hive_tables on Atlas Entities.

On a Spark3 Virtual Cluster, enableHiveSupport() must be called in the following way: spark = SparkSession.builder.enableHiveSupport().getOrCreate() You may also use Spark2 in lieu of Spark3 as this issue does not occur in Spark2.

This issue is resolved.

DEX-8774: Job and run cloning is not fully supported in Cloudera Data Engineering 1.17 through 1.18.1

Currently, cloning job and runs are not supported in Cloudera Data Engineering 1.17 through 1.18.1.

Clone jobs and run operations by navigating to the Administration page, clicking View Jobs on the respective Virtual Cluster.

This issue is resolved.

DEX-8515: The Spark History Server user interface is not visible in Cloudera Data Engineering

During job execution in Cloudera Data Engineering 1.18, the Spark History Server user interface is not visible. This error will be fixed in Cloudera Data Engineering 1.18.1.

This issue is resolved.

DEX-6163: Error message with Spark 3.2 and Cloudera Data Engineering

For Cloudera Data Engineering 1.16 through 1.18, if you experience an error message, "Service account may have been revoked" with Spark 3.2 and Cloudera Data Engineering, note that this is not the core issue despite what the error message states. Look for other exceptions as it is a harmless error and only displays after a job fails due to another reason. The error message displays as part of the shutdown process. This issue will be fixed in Cloudera Data Engineering 1.18.1.

This issue is resolved.

DEX-7653: Updating Airflow Job/Dag file throws a 404 error

A 404 error occurs when you update an Airflow Job/Dag file with a modified DAG ID or name when you initiate the following steps:

Create an Airflow job using a Simple Dag file. Use the Create Only option.
Edit the Airflow Job and delete the existing DAG file.
Upload the same DAG file with a modified DAG ID and Name from it's content.
Choose a different Resource Folder.
Use the Update and Run option.
The 404 error occurs.
To avoid this issue, ensure that you do not modify the DAG ID in step 3. If you must change your DAG ID in the dag file, then create a new file.

This issue is resolved.

CDPD-40396 Iceberg migration fails on partitioned Hive table created by Spark without location

Iceberg provides a migrate procedure to migrate a Parquet/ORC/Avro Hive table to Iceberg. If the table was created using Spark and the location is not specified, and is partitioned, the migration fails.

If you are using Data Lake 7.2.15.2 or higher, the above known-issue will not occur. Otherwise, you’ll need to unset the default table property of 'TRANSLATED_TO_EXTERNAL' from 'true' by completing the following:

Run ‘ALTER TABLE ... UNSET TBLPROPERTIES ('TRANSLATED_TO_EXTERNAL') to unset the property.
Run the migrate procedure.

This issue is resolved.

DEX-5857 Persist job owner across Cloudera Data Engineering backup restores

Currently, the user who runs the cde backup restore command has permissions, by default, to run the Jobs. This may cause Cloudera Data Engineering jobs to fail if the workload user differs from the user who runs the Jobs on Source Cloudera Data Engineering Service where the backup was performed. This failure is due to the Workload User having different privileges as the user who is expected to run the job.

Ensure that the cde job restore command is performed by the same user who is running the Cloudera Data Engineering Jobs in the Source Cloudera Data Engineering Cluster where the backup was performed. Additionally, you can ensure the User running the Restore has the same set of Permission as the User running the Job in Source Cloudera Data Engineering Cluster where the Backup was performed.

This issue is resolved.

DEX-7483 User interface bug for in-place upgrade (Tech Preview)

The user interface incorrectly states that the Data Lake version 7.2.15 and above is required. The correct minimum version is 7.2.14.

This issue is resolved.

DEX-6873 Kubernetes 1.21 will fail service account token renewal after 90 days

Cloudera Data Engineering on AWS running version Cloudera Data Engineering 1.14 through 1.16 using Kubernetes 1.21 will observe failed jobs after 90 days of service uptime.

Restart specific components to force regenerate the token using one of the following options:

Option 1) Using kubectl:

Setup kubectl for CDE.

Delete calico-node pods.

kubectl delete pods --selector k8s-app=calico-node --namespace kube-system

Delete Livy pods for all Virtual Clusters.
```
kubectl delete pods --selector app.kubernetes.io/name=livy --all-namespaces
```
If for some reason only one Livy pod needs to be fixed:
1. Find the virtual cluster ID through the UI under Cluster Details.
2. Delete Livy pod:
```
export VC_ID=<VC ID>
kubectl delete pods --selector app.kubernetes.io/name=livy --namespace ${VC_ID}
```

Option 2) Using K8s dashboard

On the Service Details page copy the RESOURCE SCHEDULER link.
Replace yunikorn part with the dashboard and open the resulting link in the browser.
In the top left corner find the namespaces dropdown and choose All namespaces.
Search for calico-node.
For each pod in the Pods table click the Delete option from the hamburger menu.
Search for livy.
For each pod in the Pods table click the Delete option from the hamburger menu.
If for some reason only one Livy pod needs to be fixed, find the Virtual Cluster ID through the UI under Cluster Details and only delete the pod with the name starting with Virtual Cluster ID.

This issue is resolved.

DEX-7286 In place upgrade (Technical Preview) issue: Certificate expired showing error in browser

Certificates failure after an in-place upgrade from 1.14.

Start the certificate upgrade:

Get cluster ID

Navigate to the Cloudera Data Engineering Overview page by clicking the Data Engineering tile in the Cloudera Data Platform (Cloudera) Management Console.
Edit device details.
Copy cluster ID filed into click board.
In a terminal set the CID environment variable to this value.
```
export CID=cluster-1234abcd
```

Get session token

Navigate to the Cloudera Data Engineering Overview page by clicking the Data Engineering tile in the Cloudera Data Platform (Cloudera) Management Console.
Right click and select Inspect
Click the Application tab.
4. Click Cookies and select the URL of the console.
5. Select cdp-session-token.
6. Double click the displayed cookie value and right click and select Copy.
7. Open a terminal screen.
```
export CST=<Paste value of cookie here>
```

Force TLS certificate update

curl -b cdp-session-token=${CST}  -X 'PATCH' -H 'Content-Type: application/json' -d '{"status_update":"renewTLSCerts"}' 'https://<URL OF CONSOLE>/dex/api/v1/cluster/$CID'

This issue is resolved.

CDPD-40396 Iceberg migration fails on partitioned Hive table created by Spark without location

Iceberg provides a migrate procedure for migrating a Parquet/ORC/Avro Hive table to Iceberg. If the table was created using Spark without specifying location and is partitioned, the migration fails.

By default, the table has a TRANSLATED_TO_EXTERNAL property and that is set to true. Unset this property by running

ALTER TABLE ...
              UNSET TBLPROPERTIES ('TRANSLATED_TO_EXTERNAL')

and then run the migrate procedure.

This issue is resolved.

COMPX-5494: Yunikorn recovery intermittently deletes existing placeholders

On recovery, Yunikorn may intermittently delete placeholder pods. After recovery, there may be remaining placeholder pods. This may cause unexpected behavior during rescheduling.

There is no workaround for this issue. To avoid any unexpected behavior, Cloudera suggests removing all the placeholders manually before restarting the scheduler.

This issue is resolved.

DWX-8257: Cloudera Data Warehouse Airflow Operator does not support SSO

Although Virtual Warehouse (VW) in Cloudera Data Warehouse supports SSO, this is incompatible with the Cloudera Data Engineering Airflow service as, for the time being, the Airflow Cloudera Data Warehouse Operator only supports workload username/password authentication.

Disable SSO in the VW.

This issue is resolved.

COMPX-7085: Scheduler crashes due to Out Of Memory (OOM) error in case of clusters with more than 200 nodes

Resource requirement of the YuniKorn scheduler pod depends on cluster size, that is, the number of nodes and the number of pods. Currently, the scheduler is configured with a memory limit of 2Gi. When running on a cluster that has more than 200 nodes, the memory limit of 2Gi may not be enough. This can cause the scheduler to crash because of OOM.

Increase resource requests and limits for the scheduler. Edit the YuniKorn scheduler deployment to increase the memory limit to 16Gi.

For example:

resources: 
  limits: 
    cpu: "4"
    memory: 16Gi
 requests: 
    cpu: "2"
    memory: 8Gi

This issue is resolved.

DEX-3997: Python jobs using virtual environment fail with import error

Running a Python job that uses a virtual environment resource fails with an import error, such as:

Traceback (most recent call last):
  File "/tmp/spark-826a7833-e995-43d2-bedf-6c9dbd215b76/app.py", line 3, in <module>
    from insurance.beneficiary import BeneficiaryData
ModuleNotFoundError: No module named 'insurance'

Do not set the spark.pyspark.driver.python configuration parameter when using a Python virtual environment resource in a job.

This issue is resolved.