General known issues with Cloudera Data Engineering

DEX-18832: Update CDP CLI with the latest DEX changes

With Data Lake 7.3.1 and higher, while using the CDP CLI to create a virtual cluster, the following error is displayed:

An error occurred: can't find runtime catalog ID with CDE 1.25.0, Datalake 7.3.2, Spark 3.5.1 and given osnames (Status Code: 500; Error Code: UNKNOWN; Service: de; Operation: createVc; Request ID: 1a8b575d-cfa9-4fb7-a65b-6be6ebf97a0c;)

Use the Cloudera Data Engineering UI, or the Cloudera Data Engineering API to create a virtual cluster.

Cloudera Data Engineering 1.25.0

DEX-18481: Retry of Cloudera Data Engineering service restore via script is failing with cluster with same ID already exists error

If a service is deleted and later restored with the same ID, and during this process the environment is down or deleted, you will see a “cluster with same ID already exists” error. This error occurs even if the service is not displayed on the Cloudera Data Engineering UI.

Raise a Support ticket to request the deletion of the Cloudera Data Engineering Base Entry from the DEX database.

Control Plane for Cloudera Data Engineering versions 1.25.0 and lower.

N/A

DEX-18456: Airflow jobs fail during node upgrade

If you have Azure Node Image Auto Upgrade enabled, during the upgrade, Cloudera Data Engineering Airflow jobs can fail, due to node failures. The root cause is that the node image is upgraded automatically and causes all running Airflow schedule pods to terminate and Airflow jobs to fail.

Disable Azure Node Image Auto Upgrade.

Cloudera Data Engineering 1.25.0 and lower versions.

N/A

DEX-18440: Cloudera Data Engineering Jobs are not reflecting in the spark-history Server

In Cloudera Data Engineering services running in AWS with Spark version higher than or equal to 3.5.1 with virtual clusters having the following versions: Cloudera Data Engineering 1.22.0, 1.23.0, 1.24.1 and 1.25.0, and running behind the Non Transparent Proxies (NTP), the Spark history server is not able to discover the correct AWS S3 bucket region, due to a change in Hadoop AWS builds. The affected runtimes are Cloudera Data Engineering running with 7.2.18 and 7.3.1 runtimes. In such a case, the following error is displayed in SHS pod logs and the Spark UI does not display for the finished jobs.

Exception in thread "main" org.apache.hadoop.fs.s3a.AWSApiCallTimeoutException: getFileStatus on s3a://qe-s3-bucket-longrunning/cluster-logs/dex/cluster-zff7f26d/6fwjjzgs/eventlog: software.amazon.awssdk.core.exception.ApiCallTimeoutException: Client execution did not complete before the specified timeout configuration: 60000 millis
 at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:222)
 at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:154)
 at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:402

The workaround is to set the correct bucket region in the Spark history configuration using the spark.hadoop.fs.s3a.endpoint.region parameter.

Prerequisite: kubectl access to the Cloudera Data Engineering cluster.

Steps:

Determine the log bucket from the “summary” page of Data Lake. For example, us-west-2.
Determine the Virtual Cluster (VC) namespace of the affected VC, which is the same as its VC-ID. For example: dex-app-gk9kpt24. Export it in its current shell session.
```
export DEX_APP_NS=dex-app-gk9kpt24
```
Use the following command to edit the Spark default configmap.
```
kubectl edit cm -n $DEX_APP_NS $DEX_APP_NS-spark-defaults
```
Add the spark.hadoop.fs.s3a.endpoint.region: us-west-2 property as shown below in the screenshot and save the file.
Use the following command to verify the changes committed in the configmap.
```
kubectl get cm -n $DEX_APP_NS $DEX_APP_NS-spark-defaults -o yaml
```
Restart the Spark History Server (SHS) pod for the changes to take effect.
1. Get the pod name:
```
kubectl get pods  -n $DEX_APP_NS
```
2. Delete the pod:
```
kubectl delete pod  -n $DEX_APP_NS dex-app-gk9kpt24-shs-7b7cfd94dd-8mdqh
```

After this procedure, the Spark UI displays all previous and new job runs.

Cloudera Data Engineering 1.22.0, 1.23.0, 1.24.1 and 1.25.0

N/A

DEX-18305: Cannot create VC with latest 1.25.0 Control Plane on Cloudera Data Engineering Service 1.24.1-h1-b14 without specifying sparkVersion in the request

In Cloudera Data Engineering versions lower than 1.25.0, the createInstance API and the corresponding CDP CLI to create virtual clusters work when you do not specify a Spark version. In such a case, by default, a Spark version is selected.

In Cloudera Data Engineering 1.25.0, the createInstance API operates as follows:

If you specify the Spark version, the createInstance API and the CDP CLI work.
If you do not specify the Spark version, the createInstance API and the CDP CLI give an error.

Cloudera Data Engineering 1.25.0.

N/A

DEX-18203: Suspending multiple VCs simultaneously is leading to race condition

Suspending multiple Virtual Clusters (VCs) simultaneously can cause inconsistent state to the service.

Suspend VCs in a sequential manner.

Cloudera Data Engineering 1.25.0.

N/A

DEX-17369: Cloudera Data Engineering Service's Resource Edit is Failing with status 500: [error validating instance group update request]

You cannot edit the Auto-scaling Range for Spot and On-demand instances in both Core and All-Purpose tiers, after the Cloudera Data Engineering service is created. The Cloudera Data Engineering service edit operation fails if you try to edit these parameters after the service creation.

Raise a ticket with Cloudera Support to guide you through the changes needed for editing the Auto-scaling Range manually at the cloud provider.

Cloudera Data Engineering 1.23.0 and higher versions

N/A

DEX-16495: Write Spark DF/Stream to Hive using Hive Streaming fails with "org.apache.hive.streaming.InvalidTable"

In Spark 3.3.0, there is an issue with writing Spark Dataframe to Hive using Hive Streaming, that is DATAFRAME_TO_STREAM. The failure is related to compatibility issues between Spark 3.3.0 runtime components and Hive client libraries.

Cloudera Data Engineering Hive Warehouse Connector on Spark 3.3.0 with Cloudera Data Engineering 1.25.0

N/A

DEX-16492: Write Spark DF/Stream to Hive using Hive Streaming fails with "org.apache.hadoop.fs.s3a.impl.InstantiationIOException"

This issue affects both AWS and Azure. During the release of Spark 3.5.1, supported by the 7.2.18 runtime, the RAZ library has gone through backward incompatible changes and it requires in core-site.xml to remove two parameters and to add two new parameters. This has been achieved by unsetting and setting the required parameters through the runtime API server.

It has been observed that a few integration parameters, like Hive Warehouse Connector (HWC) parameters, do not accept these passed-in parameters and read only parameters defined in core-site.xml. As a result, a manual workaround is needed for the affected customers to fix the core-site.xml by removing the required parameters and adding the new parameters for the affected Virtual Clusters (VCs).

Parameters to be removed from core-site.xml:

fs.s3a.custom.signers
fs.s3a.s3.signing-algorithm

Parameters to be added to core-site.xml:

spark.hadoop.fs.s3a.http.signer.class: org.apache.ranger.raz.hook.s3.RazS3SignerPlugin
spark.hadoop.fs.s3a.http.signer.enabled: true

Necessary user rights, privileges: Kubectl access is required to the Cloudera Data Engineering cluster.

Prerequisites

kubectl access to the Cloudera Data Engineering cluster
jq xmlstarlet base64 must be installed on the local machine

CLI procedure and example

Steps:

Download the raz-signer-params-workaround.sh script.
Make sure to install all the dependencies on the host machine.
1. kubectl
2. xmlstarlet
3. base64

Make the script executable.

chmod +x ./raz-signer-params-workaround.sh

Find the namespace of the affected VC. It is the name of the VC. Example:
```
dex-app-c59x5d57
```
Use the run_test command to apply the changes and pass the VC namespace as an argument to the script, as shown in the following example.
```
./raz-signer-params-workaround.sh run_test dex-app-c59x5d57
```

After running the script successfully, the org.apache.hadoop.fs.s3a.impl.InstantiationIOException error is not displayed while running the jobs again.

Cloudera Data Engineering 1.23.0, 1.23.1, 1.24.1, and 1.25.0.

N/A

DEX-16426: Error fetching data from Hive Managed Table in JDBC_CLUSTER mode

If you use Spark Hive Warehouse Connector and Spark version 3.5.1, there is an issue related to fetching data from Hive managed tables in JDBC_CLUSTER read mode, which results in a syntax error.

Error message:

[PARSE_SYNTAX_ERROR] Syntax error at or near ‘.’ (line 1, pos 21)

Pass the following job level configurations:

--conf spark.sql.legacy.useV1Command=true

--conf spark.sql.legacy.v1IdentifierNoCatalog=true

Cloudera Data Engineering 1.25.0.

N/A

DEX-17969: Airflow failed to connect to Impala in Cloudera Data Engineering

Due to a Python thrift-related error, which is a dependency of Impyla, Airflow fails to connect to Impala in Cloudera Data Engineering. This issue results in a system error related to the PY_SSIZE_T_CLEAN macro when using fast binary encoding.

While setting up the connection, include the following line after the connect command and before the cursor() elements:

con.service.client._iprot._fast_decode = None

Cloudera Data Engineering 1.24.1 and higher versions

Cloudera Data Engineering 1.25.0

DEX-17581: Cloudera Data Engineering-1.24.1 is not getting deployed in East US region

Only applicable to Azure. Cloudera Data Engineering service creation failed during the database server provisioning step. The issue occurred because the Azure API, which Cloudera Data Engineering uses to retrieve the supported database instance types for the specified region (for example, eastus), returned an empty response. As a result, the database server provisioning could not proceed. The following error message appeared in the Cloudera Data Engineering service logs:

unable to get MySQL flexible server DB instance type for cluster, Error: no instance types available for MySQL flexible server DB service tier: GeneralPurpose having vCores 2

Cloudera has raised a support ticket with Microsoft regarding this issue. According to their response, the empty API response for the specified location occurred because the quota for Azure MySQL Flexible Server was either unavailable or disabled in the given region for the subscription. If you encounter the same issue, contact Microsoft Support and request them to enable the quota for MySQL Flexible Servers in the affected region for your subscription.

DEX-17565: Links to download cdeconnect and pyspark tars for Spark Connect are giving HTTP 404 error

Links to download cdeconnect and

pyspark
            tars

for Spark Connect give an HTTP 404 error.

Replace 7.2.18.800 with 7.2.18.0 in the URL.

Cloudera Data Engineering 1.25.0

DEX-17519: Sessions are not killed as per the ttl configured in Azure and AWS

Sessions are not killed as per the ttl configured in Azure and in AWS. The calculation of timeout has gone wrong in the isTimeout method in the Livy code. This method takes a calculated timeout in milliseconds and converts it into nano seconds. However, the caller is already passing the calculated timeout value in nano seconds. In the isTimeout method, the calculatedTimeout value is converted again, which provides a different value. Therefore, (

toTime -
            fromTime

) will not be greater than the calculated timeout, as the calculated timeout value is higher. For this reason, the sessions are not killed after the timeout is reached.

Cloudera Data Engineering 1.24.1-H1

DEX-17507: Restore of Scheduled Jobs are failing due to time format

Restoring the Spark Jobs with the Schedule Configuration fails if the start date or end date uses a time format other than RFC3339Nano. This issue affects only jobs created using non-UI options, such as the API or CLI.

Before taking the backup, edit the schedule configuration of the affected Spark Job to use the RFC3339Nano time format.

Cloudera Data Engineering 1.24.1-H1

DEX-17500: [CDP Cli] Spark OsName "chainguard" Not Triggering Error in Cloudera Data Engineering Version 1.23.1 Virtual Cluster

Cloudera Data Engineering allows the creation of a Virtual Cluster with the securityhardened option in Cloudera Data Engineering version 1.23.1, without any error message. Technically, it is using UBI [redhat] underneath, which is correct, but it can lead to confusion, as the property in the Virtual Cluster states securityhardened.

Avoid using the CDP CLI to provide the spark.osname[="securityhardened"] to create a Virtual Cluster, as it is not supported in Cloudera Data Engineering versions lower than 1.24.1.

Cloudera Data Engineering 1.24.1-H1

DEX-17458: Cloudera Data Engineering session creation is failing with

java.util.concurrent.ExecutionException:
            javax.security.sasl.SaslException

Cloudera Data Engineering sessions created in a Spark 3.3.0 Virtual Cluster fail to create. The following error is listed in the driver logs:

Exception in thread "main" java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: Client closed before SASL negotiation finished

Cloudera Data Engineering 1.24.1-H1

DEX-16747: Cloudera Data Engineering 1.23.1-b114 - Driver container stderr, and stdout logs are missing for some Spark jobs

For some job runs, intermittently, the driver stderr and stdout logs are missing.

DEX-16414: Sessions GET endpoint not returning empty array

When no sessions are present in a Virtual Cluster, the Sessions page on the Cloudera Data Engineering UI displays 'Loading' state, instead of empty state.

Upgrade to Cloudera Data Engineering version 1.24.1, or higher versions.

Alternatively, maintain at least one session in the Virtual Cluster. Using the CDE CLI, create at least one session in the Virtual Cluster, and to avoid resource consumption, kill the same session.

Cloudera Data Engineering 1.25.0

DEX-15884: Resource file upload did not pick the modified file intermittently

When you attempt to update a file by uploading a new version with the exact same filename, the operation appears to succeed, but the content of the file is not updated. The system continues to serve the previous version of the file. This issue has been observed to occur intermittently under the following conditions:

Uploading a file to overwrite an existing file with the same name.
Deleting the original file first and then uploading a new file with the same name.

Specify a different filename for the resource.

DEX-15714: Proxy settings are not propagating to Cloudera Data Engineering sessions

Proxy settings from a configured CDP proxy (configmap: cdp-proxy-config) are not propagated to Cloudera Data Engineering sessions. Proxy settings for Cloudera Data Engineering jobs are propagated through spark.driver.extraJavaOptions and spark.executor.extraJavaOptions, as standard JAVA_OPTS. For more information, see Cloudera public proxy documentation.

Add the proxy settings manually to spark.executor.extraJavaOptions and spark.driver.extraJavaOptions and after that create the session.

DEX-15461: Writing Spark Dataframe to Hive using HWC Fails with java.util.NoSuchElementException: None.get

This is a known issue while writing data in ORC format. The issue has been fixed internally, but more testing is needed. This issue will be part of the Hive Warehouse Connector and Cloudera Data Engineering certification in the future.

Cloudera Data Engineering 1.25.0

DEX-14725: virtualenv cannot access pypi mirror

When a Python virtual environment is created, virtual-env needs to access the internet to seed packages such as pip, setup-tools, and wheel. If you block the public internet access (for example, in case of a private network), certain packages fail to build. Example package: requests-kerberos

Use custom Docker images. Example: https://community.cloudera.com/t5/Community-Articles/Creating-Custom-Runtimes-with-Spark3-Python-3-9-on-Cloudera/ta-p/368867.

DEX-14385: Backup fails if there is a Git repository resource

In the Cloudera Data Engineering 1.20.3 services, if there is a Git repository resource, the cluster backup fails.

Remove the Git repository.

Cloudera Data Engineering 1.21.0

DEX-12616: Node Count shows zero in /metric request

Cloudera Data Engineering 1.20.3 introduced compatibility with Kubernetes version 1.27. With this update, the kube_state_metrics no longer provides label and annotation metrics by default.

Earlier, Cloudera Data Engineering used label information to calculate the Node Count for both Core and All-Purpose nodes, which was automatically exposed. However, due to the changes in kube_state_metrics, this functionality is no longer available by default. As a result, the Node count shows zero in /metrics, charts, and the user interface.

Make sure that you set a kube-config for the Cloudera Data Engineering service.

Run the following command, which exposes the label manually in the prometheus-kube-state-metrics container:

`kubectl patch deployment monitoring-prometheus-kube-state-metrics --namespace monitoring --type='json' -p=\
'[{"op": "add",
"path":"/spec/template/spec/containers/0/args", "value":["--metric-labels-allowlist=nodes=[role]"]}]'`

For more information about how to see the node count, see Checking the node count on your Cloud Service Provider's website.

DEX-11340: Kill all the alive sessions in prepare-for-upgrade phase of stop-gap solution for upgrade

If Spark sessions are running during the Cloudera Data Engineering upgrade, they are not automatically killed, leaving them in an unknown state during and after the upgrade.

Ensure that you do not have any Spark sessions running during the Cloudera Data Engineering upgrade. If they are running during the Cloudera Data Engineering upgrade, kill them before proceeding.

DEX-14084: No error response for Airflow Python virtual environment at Virtual Cluster level for view only access user

If a user with a view only role on a Virtual Cluster (VC) tries to create an Airflow Python virtual environment on a VC, the access is blocked with a 403 error. However, the no-access 403 error is not displayed on the UI.

DEX-11639: "CPU" and "Memory" Should Match Tier 1 and Tier 2 Virtual Clusters AutoScale

CPU and Memory options in the service or cluster edit page display the values for Core (tier 1) and All-Purpose (tier 2) together. However, they must be separate values for Core and All-Purpose.

DEX-12482: [Intermittent] Diagnostic Bundle generation taking several hours to generate

Diagnostics bundles can intermittently take very long to get generated due to low EBS throughput and IOPS of the base node.

Increase the EBS throughput and IOPS values in the CloudFormation template, then trigger new diagnostic bundles.

DEX-14253: Cloudera Data Engineering Spark Jobs are getting stuck due to the unavailability of the spot instances

The unavailability of AWS spot instances may cause Cloudera Data Engineering Spark jobs to get stuck.

Re-create the Virtual Cluster with on-demand instances.

DEX-14192: Some Spark 3.5.1 jobs have slightly higher memory requirements

Some jobs running on Spark 3.5.1 have slightly higher memory requirements, resulting in the driver pods getting killed with a k8s OOMKilled.

Increase the driver pod memory from the default 1GB to 1.2GB in the job's configuration.

DEX-14173: VC Creation is failing with "Helm error: 'timed out waiting for the condition', no events found for chart"

In case of busy k8s clusters, installing VC/Cloudera Data Engineering may fail with an error message showing

Helm error: 'timed out waiting for the condition',
            no events found for chart

.

Try installing again. The failure is due to image pulls timing out. The installation will go through as more resources become available.

DEX-13957: Cloudera Data Engineering metrics and graphs show no data

Cloudera Data Engineering versions 1.20.3 and 1.21 use Kubernetes version 1.27. In Kubernetes version 1.27, by default, the kube_state_metrics does not provide label and annotation metrics. For this reason, the node count shows zero for core and all-purpose nodes in the Cloudera Data Engineering UI and in charts.

As a prerequisite, set a kube-config for the Cloudera Data Engineering service and run:

kubectl patch deployment monitoring-prometheus-kube-state-metrics --namespace monitoring --type='json' -p=\
'[{"op": "add",
"path":"/spec/template/spec/containers/0/args", "value":["--metric-labels-allowlist=nodes=[role]"]}]'

DEX 11498: Spark job failing with error: "Exception in thread "main" org.apache.hadoop.fs.s3a.AWSBadRequestException:"

When users in Milan and Jakarta region use Hadoop s3a client to access AWS s3 storage, that is using s3a://bucket-name/key to access the file, an error may occur. This is a known issue in Hadoop.

Set the region manually as: spark.hadoop.fs.s3a.endpoint.region=<region code> . For region codes, see https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region.

DEX-10147: Grafana issue for virtual clusters with the same name

In Cloudera Data Engineering 1.19, when you have two different Cloudera Data Engineering services with the same name under the same environment, and you click the Grafana charts for the second Cloudera Data Engineering service, metrics for the Virtual Cluster in the first Cloudera Data Engineering service will display.

After you have upgraded Cloudera Data Engineering, you must verify other things in the upgraded Cloudera Data Engineering cluster except the data shown in Grafana. Once you have verified everything in the new upgraded Cloudera Data Engineering service, the old Cloudera Data Engineering service should be deleted and the Grafana issue will be fixed.

DEX-9112: VC deployment frequently fails when deployed through the CDP CLI

In Cloudera Data Engineering 1.19, when a Virtual Cluster is deployed using the CDP CLI, it fails frequently as the pods fail to start. However, creating a Virtual cluster using the UI is successful.

Ensure that you are using proper units to --memory-requests in "cdp de" CLI, for example "--memory-requests 10Gi".

DEX-9879: Infinite while loops not working in Cloudera Data Engineering Sessions

If an infinite while loop is submitted as a statement, the session will be stuck infinitely. This means that new sessions can't be sent and the Session stays in a busy state. Sample input:

while(True) {
  print("hello")
}

Copy and use the DEX_API that can be found on the Virtual Cluster details page to cancel the statement: POST $DEX_API/sessions/{session-name}/statements/{statement-id}/cancel. The Statement ID can be found by running the cde sessions statements command from the CDE CLI.
Kill the Session and create a new one.

DEX-9898: CDE CLI input reads break after interacting with a Session

After interacting with a Session through the sessions interact command, input to the CDE CLI on the terminal breaks. In this example below, ^M displays instead of proceeding:

> cde session interact --name sparkid-test-6
WARN: Plaintext or insecure TLS connection requested, take care before continuing. Continue? yes/no [no]: yes^M

Open a new terminal and type your Cloudera Data Engineering commands.

DEX-9881: Multi-line command error for Spark-Scala Session types in the CDE CLI

In Cloudera Data Engineering 1.19, Multi-line input into a Scala session on the CDE CLI will not work as expected, in some cases. The CLI interaction will throw an error before reading the complete input. Sample input:

scala> type
     |

Use the UI to interact with Scala sessions. A newline is expected in the above situation. In Cloudera Data Engineering 1.19, only unbalanced brackets will generate a new line. In Cloudera Data Engineering 1.20, all valid Scala newline conditions will be handled:

scala> customFunc(
     | (
     | )
     | )
     |

DEX-9756: Unable to run large raw Scala jobs

Scala code with more than 2000 lines could result in an error.

To avoid the error, increase the stack size. For example, "spark.driver.extraJavaOptions=-Xss4M", "spark.driver.extraJavaOptions=-Xss8M", and so forth.

DEX-8679: Job fails with permission denied on a RAZ environment

When running a job that has access to files is longer than the delegation token renewal time on a RAZ-enabled Cloudera environment, the job will fail with the following error:

Failed to acquire a SAS token for get-status on /.../words.txt due to org.apache.hadoop.security.AccessControlException: Permission denied.

DEX-3706: The Cloudera Data Engineering home page not displaying for some users

The Cloudera Data Engineering home page will not display Virtual Clusters or a Quick Action bar if the user is part of hundreds of user groups or subgrooups.

The user must access the Administration page and open the Virtual Cluster of choice to perform all Job-related actions. This issue will be fixed in Cloudera Data Engineering 1.18.1

DEX-8283: False Positive Status is appearing for the Raw Scala Syntax issue

Raw Scala jobs that fail due to syntax errors are reported as succeeded by Cloudera Data Engineering as displayed in this example:

spark.range(3)..show()

The job will fail with the following error and will be logged in the driver stdout log:

/opt/spark/optional-lib/exec_invalid.scala:3: error: identifier expected but '.' found.
    spark.range(3)..show()
                   ^

This issue will be fixed in Cloudera Data Engineering 1.18.1.

DEX-8281: Raw Scala Scripts fail due to the use of the case class

Implicit conversions which involve implicit Encoders for case classes, that are usually supported by importing spark.implicits._, don't work in Raw Scala jobs in Cloudera Data Engineering. These include converting Scala objects, including RDD Dataset DataFrame, and Columns. For example, the following operations will fail on Cloudera Data Engineering:

import org.apache.spark.sql.Encoders
import spark.implicits._
case class Case(foo:String, bar:String)

// 1: an attempt to obtain schema via the implicit encoder for case class fails
val encoderSchema = Encoders.product[Case].schema
encoderSchema.printTreeString()

// 2: an attempt to convert RDD[Case] to DataFrame fails
val caseDF = sc
	.parallelize(1 to 3)
	.map(i => Case(f"$i", "bar"))
	.toDF

// 3: an attempt to convert DataFrame to Dataset[Case] fails
val caseDS = spark
	.read
	.json(List("""{"foo":"1","bar":"2"}""").toDS)
	.as[Case]

Whereas conversions that involve implicit encoders for primitive types are supported:

val ds = Seq("I am a Dataset").toDS
val df = Seq("I am a DataFrame").toDF

Notice that List, Row, StructField, and createDataFrame are used below instead of

case class and
          .toDF():

val bankRowRDD = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
  s => Row(
    s(0).toInt,
    s(1).replaceAll("\"", ""),
    s(2).replaceAll("\"", ""),
    s(3).replaceAll("\"", ""),
    s(5).replaceAll("\"", "").toInt
  )
)

val bankSchema = List(
  StructField("age", IntegerType, true),
  StructField("job", StringType, true),
  StructField("marital", StringType, true),
  StructField("education", StringType, true),
  StructField("balance", IntegerType, true)
)

val bank = spark.createDataFrame(
  bankRowRDD,
  StructType(bankSchema)
)


bank.registerTempTable("bank")

DEX-7051 EnvironmentPrivilegedUser role cannot be used with Cloudera Data Engineering

The role EnvironmentPrivilegedUser cannot currently be used by a user if a user wants to access Cloudera Data Engineering. If a user has this role, then this user will not be able to interact with Cloudera Data Engineering as an "access denied" would occur.

Cloudera recommends to not use or assign the EnvironmentPrivilegedUser role for accessing Cloudera Data Engineering.

Cloudera Data Engineering 1.19.2

Strict DAG declaration in Airflow 2.2.5

Cloudera Data Engineering 1.16 introduces Airflow 2.2.5 which is now stricter about DAG declaration than the previously supported Airflow version in Cloudera Data Engineering. In Airflow 2.2.5, DAG timezone should be a pendulum.tz.Timezone, not datetime.timezone.utc.

If you upgrade to Cloudera Data Engineering 1.16, make sure that you have updated your DAGs according to the Airflow documentation, otherwise your DAGs will not be able to be created in Cloudera Data Engineering and the restore process will not be able to restore these DAGs.

Example of valid DAG:

import pendulum 
dag = DAG("my_tz_dag", start_date=pendulum.datetime(2016, 1, 1, tz="Europe/Amsterdam")) 
op = DummyOperator(task_id="dummy", dag=dag)

Example of invalid DAG:

from datetime import timezone
from dateutil import parser
dag = DAG("my_tz_dag", start_date=parser.isoparse('2020-11-11T20:20:04.268Z').replace(tzinfo=timezone.utc)) 
op = DummyOperator(task_id="dummy", dag=dag)

COMPX-6949: Stuck jobs prevent cluster scale down

Because of hanging jobs, the cluster is unable to scale down even when there are no ongoing activities. This may happen when some unexpected node removal occurs, causing some pods to be stuck in Pending state. These pending pods prevent the cluster from downscaling.

Terminate the jobs manually.

YK 0.10.2