General known issues with Cloudera Data Engineering
Learn about the general known issues with the Cloudera Data Engineering (CDE) service on public clouds, the impact or changes to the functionality, and the workaround.
- DEX-12616: Node Count shows zero in /metric request
-
Cloudera Data Engineering (CDE) 1.20.3 introduced compatibility with Kubernetes version 1.27. With this update, the
kube_state_metrics
no longer provides label and annotation metrics by default.Earlier, CDE used label information to calculate the Node Count for both Core and All-Purpose nodes, which was automatically exposed. However, due to the changes in
kube_state_metrics
, this functionality is no longer available by default. As a result, the Node count shows zero in /metrics, charts, and the user interface.
- DEX-11340: Kill all the alive sessions in prepare-for-upgrade phase of stop-gap solution for upgrade
- If Spark sessions are running during the CDE upgrade, they are not automatically killed, leaving them in an unknown state during and after the upgrade.
- DEX-14094: New pipeline editor can create a cde_job step of its own pipeline job which causes recursion and looping
-
If you add a CDE Job step in the pipeline editor and select the same job as the pipeline job from the Select Job drop-down list while configuring the pipeline job using the editor, then running the pipeline job results in a recursive loop.
For example, You have created a pipeline job named test-dag and selected the same job test-dag from the Select Job drop-down list while adding the CDE Job step, then running the pipeline job results in a recursive loop.
- DEX-14084: No error response for Airflow Python virtual environment at Virtual Cluster level for view only access user
- If a user with a view only role on a Virtual Cluster (VC) tries to create an Airflow Python virtual environment on a VC, the access is blocked with a 403 error. However, the no-access 403 error is not displayed on the UI.
- DEX-13465: CDE Airflow DAG Code Fetch is not working
- The embedded Airflow UI within the CDE Job pages does not correctly show the "Code" view.
- DEX-11639: "CPU" and "Memory" Should Match Tier 1 and Tier 2 Virtual Clusters AutoScale
- CPU and Memory options in the service or cluster edit page display the values for Core (tier 1) and All-Purpose (tier 2) together. However, they must be separate values for Core and All-Purpose.
- DEX-12482: [Intermittent] Diagnostic Bundle generation taking several hours to generate
- Diagnostics bundles can intermittently take very long to get generated due to low EBS throughput and IOPS of the base node.
- DEX-14253: CDE Spark Jobs are getting stuck due to the unavailability of the spot instances
- The unavailability of AWS spot instances may cause CDE Spark jobs to get stuck.
- DEX-14192: Some Spark 3.5.1 jobs have slightly higher memory requirements
- Some jobs running on Spark 3.5.1 have slightly higher memory requirements, resulting in the driver pods getting killed with a k8s
OOMKilled
.
- DEX-14173: VC Creation is failing with "Helm error: 'timed out waiting for the condition', no events found for chart"
- In case of busy k8s clusters, installing VC/CDE may fail with an error message showing
Helm error: 'timed out waiting for the condition', no events found for chart
.
- DEX-14067: Email alerting is not sending email to the recipient even after the job fails
- When a Virtual Cluster is created through the UI, SMTP configuration and authentication is empty. Enabling Email Alerting after VC creation allows users to add the SMTP parameters, but the updated password is not fetched when sending email alerts.
- DEX-14027: Spark 3.5.1 jobs are failing with error 'org.apache.hadoop.fs.s3a.impl.InstantiationIOException'
- Spark 3.5.1 RAZ-enabled CDE clusters fail to initialize RAZ S3 plugin library due to recent backward incompatible changes in the library, and jobs fail with error
org.apache.hadoop.fs.s3a.impl.InstantiationIOException
.
- DEX-13975: 'spark.catalog.listTables()' command in job is failing in Python Spark for Spark 3.5.1
- Using
catalog.listTables()
with Iceberg tables results in an exceptionorg.apache.spark.sql.catalyst.parser.ParseException: [PARSE_SYNTAX_ERROR] Syntax error at or near end of input
.
- DEX-13957: CDE metrics and graphs show no data
- CDE versions 1.20.3 and 1.21 use Kubernetes version 1.27. In Kubernetes version 1.27, by default, the kube_state_metrics does not provide label and annotation metrics. For this reason, the node count shows zero for core and all-purpose nodes in the CDE UI and in charts.
- DEX-12630: CDE Service failed to run jobs on SSD Based clusters
- When a customer creates a new CDE service with an SSD Instance enabled on CDE version greater than or equal to 1.19.4, Spark and Airflow jobs do not start at all. The same problem happens if an existing CDE service is upgraded to 1.19.3 or greater and has SSD Instance enabled.
- DEX 12451: Service creation fails when "Enable Public Loadbalancer" is selected in an environment with only private subnets
- When creating a service with the Enable Public Load Balancer option selected, the
service creation fails with the following error:
“CDE Service: 1.20.3-b15 ENV: dsp-storage-mow-priv (in mow-priv) and dsp-storage-aws-dev-newvpc (mw-dev) – Environment is configured only with private subnets, there are no public subnets. dex-base installation failed, events: [\{"ResourceKind":"Service","ResourceName":"dex-base-nginx-56g288bq-controller","ResourceNamespace":"dex-base-56g288bq","Message":"Error syncing load balancer: failed to ensure load balancer: could not find any suitable subnets for creating the ELB","Timestamp":"2024-02-08T09:55:28Z"}]
- DEX 11498: Spark job failing with error: "Exception in thread "main" org.apache.hadoop.fs.s3a.AWSBadRequestException:"
- When users in Milan and Jakarta region use Hadoop s3a client to access AWS s3 storage, that is using s3a://bucket-name/key to access the file, an error may occur. This is a known issue in Hadoop.
- ENGESC-22921: Livy pod failure (CrashLoopBackoff)
- There is not enough Livy-overhead-memory causing the Livy service to crash and trigger the pod to restart.
- DEX-11086: Cancelled statement(s) not canceled by Livy
- Currently, Livy statements cant be cancelled immediately after using
/sessions/{name}/statements/{id}/cancel
. The status is returned asCancelled
but the background job continues to run.
- DEX-9939: Tier 2 Node groups are created with the same size as Tier 1 node groups during service creation. They cannot be edited during service edit
- If a service is created with 1 as the minimum on-demand scale limit, two nodes will run for Tier 1 and Tier 2. Even if a service is edited with the minimum reduced to 0, the Tier 2 node will still run. This will be fixed in the CDE 1.20 release.
- DEX-9852: FreeIPA certificate mismatch issues for new Spark 3.3 Virtual Clusters
- In CDE 1.19, when creating a new Virtual Cluster based on Spark 3.3, and submitting any job in the pods, the following error occurs: "start TGT gen failed for user : rpc error: code = Unavailable desc = Operation not allowed while app is in recovery state."
- DEX-10147: Grafana issue for virtual clusters with the same name
- In CDE 1.19, when you have two different CDE services with the same name under the same environment, and you click the Grafana charts for the second CDE service, metrics for the Virtual Cluster in the first CDE service will display.
- DEX-9932: Name length causes Pod creation error for CDE Sessions
- In CDE 1.19, the K8s pod name has a limitation of 63 Characters, and CDE Sessions has a name length of 56 maximum characters.
- DEX-9895: CDE Virtual Cluster API response displays default Spark version as 2.4.7
- In CDE 1.19, the Spark version 3.2.3 is the expected default in a CDE Spark Virtual Cluster, but Spark 2.4.7 displays instead. This issue will be fixed in CDE 1.20.
- DEX-9112: VC deployment frequently fails when deployed through the CDP CLI
- In CDE 1.19, when a Virtual Cluster is deployed using the CDP CLI, it fails frequently as the pods fail to start. However, creating a Virtual cluster using the UI is successful.
- DEX-9790: Single tab Session support for Virtual Cluster selection
- In CDE 1.19, the Virtual Cluster selection in the Jobs, Resources, Runs, and Sessions page is not preserved if the user attempts to open CDE in another browser tab/window.
- DEX-10044: Handle adding tier 2 auto scaling groups during in-place upgrades
- Since auto scaling groups (ASGs) are not added or updated during the upgrade, the tier 2 ASGs are not created. This resulted in pods that were unable to be scheduled. This error applies to services created in CDE 1.18 and then upgraded to 1.19.
- DEX-10107: Spark 3.3 in CDE 1.19 has a limitation of characters for job names
- Jobs with longer names over 23 characters can fail in Spark 3.3 with the following
exception:
23-05-14 10:14:16 ERROR ExecutorPodsSnapshotsStoreImpl: Going to stop due to IllegalArgumentException java.lang.IllegalArgumentException: '$JOB_NAME' in spark.kubernetes.executor.podNamePrefix is invalid. must conform https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names and the value length <= 47
- DEX-10055: Interacting with a killed session
- When you interact with a long-running killed Spark session, the session might become unresponsive. Refrain from interacting with the long-running killed session. This will be fixed in a future release of CDE.
- DEX-9879: Infinite while loops not working in CDE Sessions
- If an infinite while loop is submitted as a statement, the session will be stuck
infinitely. This means that new sessions can't be sent and the Session stays in a busy
state. Sample input:
while(True) { print("hello") }
- DEX-9898: CDE CLI input reads break after interacting with a Session
- After interacting with a Session through the
sessions interact
command, input to the CDE CLI on the terminal breaks. In this example below, ^M displays instead of proceeding:> cde session interact --name sparkid-test-6 WARN: Plaintext or insecure TLS connection requested, take care before continuing. Continue? yes/no [no]: yes^M
- DEX-9881: Multi-line command error for Spark-Scala Session types in the CDE CLI
- In CDE 1.19, Multi-line input into a Scala session on the CDE CLI will not work as
expected, in some cases. The CLI interaction will throw an error before reading the
complete input. Sample input:
scala> type |
- DEX-9756: Unable to run large raw Scala jobs
- Scala code with more than 2000 lines could result in an error.
- DEX-8679: Job fails with permission denied on a RAZ environment
- When running a job that has access to files is longer than the delegation token renewal
time on a RAZ-enabled CDP environment, the job will fail with the following
error:
Failed to acquire a SAS token for get-status on /.../words.txt due to org.apache.hadoop.security.AccessControlException: Permission denied.
- DEX-8769: The table entity type on Atlas is spar_tables instead of hive_tables on Spark3 Virtual Clusters
- Tables that are created using a Spark3 Virtual Cluster on an AWS setup will have spark_tables type instead of hive_tables on Atlas Entities.
- DEX-8774: Job and run cloning is not fully supported in CDE 1.17 through 1.18.1
- Currently, cloning job and runs are not supported in CDE 1.17 through 1.18.1.
- DEX-3706: The CDE home page not displaying for some users
- The CDE home page will not display Virtual Clusters or a Quick Action bar if the user is part of hundreds of user groups or subgrooups.
- DEX-8515: The Spark History Server user interface is not visible in CDE
- During job execution in CDE 1.18, the Spark History Server user interface is not visible. This error will be fixed in CDE 1.18.1.
- DEX-6163: Error message with Spark 3.2 and CDE
- For CDE 1.16 through 1.18, if you experience an error message, "Service account may have been revoked" with Spark 3.2 and CDE, note that this is not the core issue despite what the error message states. Look for other exceptions as it is a harmless error and only displays after a job fails due to another reason. The error message displays as part of the shutdown process. This issue will be fixed in CDE 1.18.1.
- DEX-7653: Updating Airflow Job/Dag file throws a 404 error
- A 404 error occurs when you update an Airflow Job/Dag file with a modified DAG ID or
name when you initiate the following steps:
- Create an Airflow job using a Simple Dag file. Use the Create Only option.
- Edit the Airflow Job and delete the existing DAG file.
- Upload the same DAG file with a modified DAG ID and Name from it's content.
- Choose a different Resource Folder.
- Use the Update and Run option.
The 404 error occurs.
To avoid this issue, ensure that you do not modify the DAG ID in step 3. If you must change your DAG ID in the dag file, then create a new file.
This issue will be fixed in CDE 1.18.1.
- DEX-8283: False Positive Status is appearing for the Raw Scala Syntax issue
- Raw Scala jobs that fail due to syntax errors are reported as succeeded by CDE as
displayed in this
example:
spark.range(3)..show()
- DEX-8281: Raw Scala Scripts fail due to the use of the case class
- Implicit conversions which involve implicit Encoders for case classes, that are usually
supported by importing spark.implicits._, don't work in Raw Scala jobs in CDE. These
include converting Scala objects, including RDD Dataset DataFrame, and Columns. For
example, the following operations will fail on
CDE:
import org.apache.spark.sql.Encoders import spark.implicits._ case class Case(foo:String, bar:String) // 1: an attempt to obtain schema via the implicit encoder for case class fails val encoderSchema = Encoders.product[Case].schema encoderSchema.printTreeString() // 2: an attempt to convert RDD[Case] to DataFrame fails val caseDF = sc .parallelize(1 to 3) .map(i => Case(f"$i", "bar")) .toDF // 3: an attempt to convert DataFrame to Dataset[Case] fails val caseDS = spark .read .json(List("""{"foo":"1","bar":"2"}""").toDS) .as[Case]
- DEX-7001: When Airflow jobs are run, the privileges of the user who created the job is applied and not the user who submitted the job
- If you have an Airflow job (created by User A) that contains Spark jobs, and the Airflow job is run by another user (User B), the Spark jobs are run as User A instead of the user who ran it. Regardless of who submits the Airflow job, the Airflow job is run with the user privileges of the user who created the job. This causes issues when the job submitter has lesser privileges than the job owner who has higher privileges. We recommend that the Spark and Airflow jobs must be created and run by the same user.
- CDPD-40396 Iceberg migration fails on partitioned Hive table created by Spark without location
- Iceberg provides a
migrate
procedure to migrate a Parquet/ORC/Avro Hive table to Iceberg. If the table was created using Spark and the location is not specified, and is partitioned, the migration fails.
- DEX-5857 Persist job owner across CDE backup restores
- Currently, the user who runs the
cde backup restore
command has permissions, by default, to run the Jobs. This may cause CDE jobs to fail if the workload user differs from the user who runs the Jobs on Source CDE Service where the backup was performed. This failure is due to the Workload User having different privileges as the user who is expected to run the job.
- DEX-7483 User interface bug for in-place upgrade (Tech Preview)
- The user interface incorrectly states that the Data Lake version 7.2.15 and above is required. The correct minimum version is 7.2.14.
- DEX-6873 Kubernetes 1.21 will fail service account token renewal after 90 days
- Cloudera Data Engineering (CDE) on AWS running version CDE 1.14 through 1.16 using Kubernetes 1.21 will observe failed jobs after 90 days of service uptime.
- DEX-7286 In place upgrade (Technical Preview) issue: Certificate expired showing error in browser
- Certificates failure after an in-place upgrade from 1.14.
- DEX-7051
EnvironmentPrivilegedUser
role cannot be used with CDE - The role
EnvironmentPrivilegedUser
cannot currently be used by a user if a user wants to access CDE. If a user has this role, then this user will not be able to interact with CDE as an "access denied" would occur. - CDPD-40396 Iceberg migration fails on partitioned Hive table created by Spark without location
- Iceberg provides a migrate procedure for migrating a Parquet/ORC/Avro Hive table to Iceberg. If the table was created using Spark without specifying location and is partitioned, the migration fails.
- Strict DAG declaration in Airflow 2.2.5
- CDE 1.16 introduces Airflow 2.2.5 which is now stricter about DAG declaration than the
previously supported Airflow version in CDE. In Airflow 2.2.5, DAG timezone should be a
pendulum.tz.Timezone
, notdatetime.timezone.utc
. - COMPX-5494: Yunikorn recovery intermittently deletes existing placeholders
- On recovery, Yunikorn may intermittently delete placeholder pods. After recovery, there may be remaining placeholder pods. This may cause unexpected behavior during rescheduling.
- DWX-8257: CDW Airflow Operator does not support SSO
-
Although Virtual Warehouse (VW) in Cloudera Data Warehouse (CDW) supports SSO, this is incompatible with the CDE Airflow service as, for the time being, the Airflow CDW Operator only supports workload username/password authentication.
- COMPX-7085: Scheduler crashes due to Out Of Memory (OOM) error in case of clusters with more than 200 nodes
-
Resource requirement of the YuniKorn scheduler pod depends on cluster size, that is, the number of nodes and the number of pods. Currently, the scheduler is configured with a memory limit of 2Gi. When running on a cluster that has more than 200 nodes, the memory limit of 2Gi may not be enough. This can cause the scheduler to crash because of OOM.
- COMPX-6949: Stuck jobs prevent cluster scale down
-
Because of hanging jobs, the cluster is unable to scale down even when there are no ongoing activities. This may happen when some unexpected node removal occurs, causing some pods to be stuck in Pending state. These pending pods prevent the cluster from downscaling.
- DEX-3997: Python jobs using virtual environment fail with import error
- Running a Python job that uses a virtual environment resource fails with an import error, such as: