Fixed issues in 1.5.5 SP1

Cloudera AI on premises 1.5.5 SP1 addresses issues previously identified as known issues.

DOCS-27138: Possible occurrence for incorrect status for successful workloads

A potential race condition in the reconciler service could result in the status of successful workloads having been incorrectly updated to unknown or failed status.

The issue occurred due to high system load, which led to incorrect status reporting after pod deletion. This issue is now resolved.

DSE-46395: Validation of roles for CAIR against UMS not working

This fix addressed an authorization bug for AI Registry APIs in the control plane. This issue is now resolved.

DSE-45793: Disable zero initial scale for autoscaler configuration

Previously, due to a known KServe issue (kserve/kserve#4471), all newly created model endpoints would initially deploy with a single replica, regardless of the specified configuration. This issue is now resolved.

DSE-40198: Resolve painpoints with installations and updates of self-signed certificates

Previously, when rotating or updating the TLS certificate used by Cloudera AI, the Cloudera AI did not automatically retrieve the new certificate from the Cloudera Control Pane. This issue is now resolved. You can upload CA certificates directly through the Cloudera AI workbench UI and refresh certificates to apply and propagate trust across all relevant Cloudera AI services seamlessly.

DSE-44238: Cannot create Cloudera AI Inference service application deployment using CDP CLI when ozone credentials are passed

Cloudera AI Inference service could not be created using CDP CLI. This issue is now resolved.

DSE-44141: Failed to delete deployment in executing DeleteMLServingApp

Cloudera AI Inference service failed to remove all namespaces if the Cloudera AI Inference service was deleted after an installation failure. This issue is now resolved.

DSE-46352: Kserve fails to pull images in airgapped environments where the docker registry is not listed in the trusted list

In airgapped environments, model endpoints failed with an error message due to the Docker registry not being included in the trusted list. This issue is now resolved.

DSE-44091: StartJobRun Kubernetes client Failure error lost during processing

The Kubernetes client failure error during the StartJobRun API was not properly reported, resulting in misleading success messages. The fix ensures that engine start failures are now accurately reported with detailed error information, eliminating misleading success notifications.

DSE-44083: WEB requests to other services have malformed UUID in logs

Web requests to other services were generating malformed UUIDs in logs. The issue is now resolved by increasing the contextId length to 36 characters, ensuring proper UUID formatting.

DSE-44088: Operator pod start failure log missing ID

The operator pod start failure logs were missing critical identifiers, such as engineId and request UUID, which made error tracking challenging. The issue is now resolved by ensuring these identifiers are included in the operator logs.

DSE-44936: Pause job removes the assigned service-account

Pausing a Cloudera AI job caused the assigned service account to be removed. The issue arose when a job was paused and the page refreshed, unintentionally clearing the service account. The resolution involved updating the job update logic to modify the run_as field only when explicitly specified, ensuring that the service account remains intact.

DSE-44563: Heterogeneous GPU support not available in APIv2 for models deployment

The lack of heterogeneous GPU support in APIv2 for model deployments is now resolved. Updates to the frontend and API code now include the accelerator label ID, enabling GPU support for model deployments.

DSE-44564: Model Deployment fails to use selected GPU during deployment

Model deployments were failing to utilize the selected GPU during deployment due to the accelerator_label_id property not being properly persisted in the database. The issue is now resolved, ensuring that the correct GPU is utilized during model deployment.

DSE-41700: Team owner must able to change team members privileges or role

The requirement for each team to have at least one Administrator is now removed for both local and synchronized teams. Team Administrators, when present, can now manage membership and roles through Team Settings. If no Team Administrator is assigned, Site Administrators can manage membership and roles instead.

DSE-25966: Memory leak detected in model proxy

A memory leak issue in the model proxy component was causing unplanned outages due to Kubernetes OOM (Out of Memory) terminations. The issue is now resolved by replacing the custom cache with a library already utilized by APIv2.

DSE-13708: Main page -> Active workload counter problems (Session, Experiment, Job, Application)

Previously, active workload counters were incorrectly categorizing workloads, counting experiments as sessions, jobs as applications, and entirely omitting applications. Additionally, the counters displayed workspace-wide active workloads for jobs, experiments, and applications instead of filtering them by the logged-in user.

This issue is now fixed, by correcting the counting logic and ensuring that the counters now display workloads specific to the logged-in user.

DSE-45166: New model build options have been introduced

model_root_dir: This option allows users to set a custom build root directory, enabling deployment even when a .git structure is nested.
build_script_path: This option provides the ability to specify a custom path for the build script.

DSE-40393: Horizontal window or browser resize is not working

Previously, the horizontal window resize functionality was not working as expected. The issue is now resolved by updating margin handling and width calculations, ensuring that the layout restores correctly when the browser window is resized.

DSE-40619: Allow emails to be updated through the UpdateJob API

The UpdateJob API previously did not allow email recipients to be modified. As a result, users were required to delete and recreate jobs to update email notifications. Additionally, the API did not support modifying email criteria, such as Success, Failure, Stopped, or Timeout notifications.

This issue is now resolved by implementing logic in the UpdateJob APIv2 to allow updates to all recipient types and attachments. The behavior now aligns with the CreateJob API, enabling seamless notification management without the need for job recreation.

DSE-41734: Livelog cleaner granularity needs to be increased

The livelog cleaner retention period was previously limited to being specified in months. Since the livelog cleaner Kubernetes job runs daily, increasing the granularity to days provides greater flexibility and control over the livelog cleanup policy.

This improvement updates the livelog cleaner retention period to be configurable in days, with a default value of 180 days, allowing for more precise management of log retention.

DSE-18866: Jupyter notebook editor initially redirects to Openshift application not available page

Previously, the Jupyter notebook editor redirected to an OpenShift Application not available page due to the iframe loading before the OpenShift route was fully created, causing an 503 error.

The issue, caused by delays in OpenShift route creation after ingress and pod readiness, is now resolved by adding a readiness probe in API v1. This ensures the editor loads only when fully ready. Periodic readiness checks for up to 10 minutes address scaling challenges and prevent CORS issues and race conditions.

DSE-44009: Feature announcement request causes homepage loading timeout in air-gapped setup

A bug caused the homepage to hang indefinitely in air-gapped environments due to a failed feature announcement request. The issue occurred because the news feed API call could not complete, causing the homepage to remain in a loading state until the browser default timeout of 300 seconds was reached.

This issue is now fixed by ensuring that the loading spinner is limited to the feature announcement section. Additionally, feeds are now loaded from a local JSON file instead of external sources, resolving the timeout issue and ensuring proper homepage functionality in air-gapped setups.

DSE-44784: Web crashing when very large files selected for job email attachment

Previously, job email notifications with file attachments could crash the web service in the following scenarios:

When the specified file was large enough, for example 10 GB, to exceed the 60-second timeout.
When the Virtual File System (VFS) service became unresponsive during file streaming.
When the VFS service experienced artificial delays.

The issue is now resolved to ensure the web service remains stable in these scenarios too.

DSE-46572: Project Owner field in Project Setting page is showing user ID instead of name

Previously, the Project Owner field on the Project Settings page was displaying user IDs instead of user names.

The issue was caused by the Project Owner field incorrectly referencing user IDs. The issue is now resolved by extracting the project owner name from the project object and implementing logic to fetch users by their names, ensuring the correct display of user names in the Project Owner field.

DSE-46860, DSE-46859, DSE-44806, DSE-43876 - Cloudera AI Workbench web and postgres database performance optimizations

Improved postgres database caching and indexing of frequently queried data. Improved Cloudera AI Web auto refresh intervals. Optimized the open websocket livelog connections used by web service.

DSE-46220, DSE-46219, DSE-46218 - Improved error handling in Cloudera AI Workbench web service

Better error handling for web services that could potentially result in service crashes.

DSE-44023 - Fixed issue with stopped applications getting started after Cloudera AI Workbench upgrades

Previously the reconciler service was restarting applications that were intentionally stopped by users, upon workspace upgrades. After the fix, stopped applications will not be restarted after workbench upgrade or when the reconciler service is restarted on its own.

DSE-43704 - Rename Custom Tee Binary to cml-tee

Previously, certain vulnerability scanners could incorrectly flag Cloudera AI as using a vulnerable version of the coreutils package, specifically the tee command.

Cloudera AI services included a custom tee binary, developed entirely in-house by Cloudera, which was not based on the open-source coreutils library. The custom tee command, version 0.9, was mistakenly identified by some scanners as the tee command from coreutils, which contains known vulnerabilities.

This issue is now resolved by renaming the Cloudera custom tee binary to cml-tee, ensuring that it is no longer misidentified.

DSE-44367 - Buildkitd pod CrashLoopBackOff due to port conflicts

During the creation or upgrade of Cloudera AI Workbench, buildkitd pods could occasionally enter a CrashLoopBackOff state. This issue occurred when the port used by BuildKit was not properly released during pod restarts or was occupied by another process. Users might have encountered errors, such as the following error:

buildkitd: listen tcp 0.0.0.0:1234: bind: address already in use

This issue is now resolved to prevent port conflict and to ensure stable functioning of Cloudera AI Workbench.

DSE-44682 - Model deployment failing at build stage due to TLS issues

Previously, TLS-related issues could occasionally arise during the model build process in Cloudera AI Workbench when using Cloudera Embedded Container Service clusters. These issues occurred specifically when pulling images from the container registry and were caused by missing registry certificates on the worker nodes. The required certificates were expected to be located at the /etc/docker/certs.d/ path.

This issue is now resolved to ensure successful model deployment and to eliminate TLS-related errors during the build stage.

DSE-40198 - Pain points with installation and updates of self-signed certificates

Previously, when the TLS certificate used by Cloudera AI was updated on Cloudera Control Plane, the system did not automatically retrieve the new certificate from the Cloudera Control Plane. This required manual intervention to update Cloudera AI with the new TLS certificate.

This issue is now resolved, and Cloudera AI now automatically retrieves the updated TLS certificate from the Cloudera Control Plane by performing the Refresh certificate action on Cloudera AI Workbench, streamlining the process and eliminating manual steps.