Fixed issues in Cloudera AI on premises 1.5.5 SP2 CHF1

This section lists the issues that have been fixed since the last release of Cloudera AI on premises.

Cloudera AI Workbench

DSE-51975: Some of the Model Metrics charts do not show data

Several Model Metrics charts, such as model replica memory usage, request and response size, and CPU usage, did not display data, due to changes to Kubernetes-level metric names that were not reflected in the corresponding queries. This issue is now fixed by updating query configurations to ensure accurate data collection and proper chart display on the Model Metrics page.

DSE-44227: Save As in the Cloudera AI editor does not switch to a new file

Previously, the Save As functionality in the Cloudera AI editor did not switch the view to a newly saved file. The problem occurred because the editor failed to run the open command after saving, leaving users on the original file.

This issue is now fixed. The Save As feature now correctly switches the editor view to the newly saved file, ensuring a seamless user experience

DSE-53131: Expose RocksDB configurations for customer-specific tuning

Previously, essential RocksDB configuration parameters for the Livelog service used hardcoded defaults. Because RocksDB is used as the internal database for the Livelog service, and its performance characteristics can significantly influence livelog service behavior, these hardcoded settings limited performance optimization.

This issue is now resolved. Essential RocksDB configuration parameters can be tuned through ConfigMaps providing greater flexibility for performance optimization while ensuring that existing data and session histories remain unaffected.

The following Key RocksDB tuning parameters can now be adjusted:

max_bytes_for_level_base
max_bytes_for_level_multiplier
Additional recommended options from the RocksDB Tuning Guide

This improvement enables safer, more controlled performance tuning aligned with workload requirements.

DSE-43983: Excessive memory usage in livelog resource during scale testing

Previously, the livelog service exhibited excessive memory usage during scale testing and deployments. This is now resolved by enabling RocksDB Direct I/O, replacing glibc with jemalloc for more efficient memory allocation, and capping the t.last cache in Go to prevent staircase‑style memory growth. These improvements significantly reduce memory pressure and improve livelog service performance under load.

DSE‑56081: Users cannot pass custom headers in application

Previously, applications could not send custom headers during cross-origin inter-application communication because of a CORS regression. Additionally, Browser Accessible Service (BAS) incorrectly set credentialed CORS headers for all subdomains. The issue is now fixed. Custom header support and cross-origin inter-application communication are now fully restored, and CORS handling in BAS is corrected to eliminate security vulnerabilities.

The fix includes the following key corrections and improvements :

Replaced the global CORS middleware with targeted handling scoped specifically to GET requests from the main Cloudera AI domain only.
Exempted OPTIONS requests from authentication to allow preflight checks to pass.
Restricted Access-Control-Allow-Origin and Access-Control-Allow-Credentials headers to main‑domain GET requests only.
Added strict origin validation, requiring exact hostname matches and preventing subdomain spoofing or attacker prefix or suffix attacks.
Restored support for custom headers, such as refresh-token and access-token, in frontend to backend application communication.

Securing applications on Cloudera AI Workbench is a shared responsibility between Cloudera and the users. While the platform enforces secure defaults and strict CORS behavior for core components, users are responsible for defining correct CORS policies and header handling for their own applications. Misconfigured or overly permissive CORS settings in user workloads can result in functional issues or security vulnerabilities.

Jobs and pipelines

DSE-51041: Job Retry changes are not persisting

Previously, job retry settings were saved but failed to be displayed correctly at the job level in the UI because of missing react properties. The issue is now fixed by restoring the missing properties. The UI now properly displays the saved job retry settings.

DSE-50513: Unable to terminate cron job

A bug prevented users from stopping stuck or problematic recurring job instances through the UI. Previously, a recurring job could be marked as Failed even if an existing instance was still running. In such cases, the Stop button would not be displayed, leaving users unable to terminate the stuck instances.

This issue is now fixed by adding a Stop button to the job history page, enabling users to stop non-terminal job runs directly from the interface. The Stop button is displayed only for job runs in non-terminal states and includes a confirmation modal along with permission checks to ensure safe and secure operation.

DSE-48646: Web pods crashing (OOM) when job emails are sent with larger attachments

Previously, web pods could crash with out-of-memory (OOM) errors when job notification emails included large attachments or excessive console logs embedded directly in the email body. This issue is now fixed. Only the last 50 lines of console output are included in the email body, while larger logs are sent as attachments when within allowed limits, preventing large logs from being inlined as HTML content.

The fix also introduces configurable attachment size limits (25 MB by default, adjustable between 10–100 MB) using site administration settings. Administrators can control the maximum total attachment size, and if the combined size exceeds this configured limit, attachments are omitted and replaced with platform access links in the email body.

Additionally, email delivery is now more resilient to SMTP constraints. If attachments (even within the configured limit) are rejected by the mail server due to payload size restrictions, the system automatically falls back to sending a links-only email instead of failing the notification.

ML Runtime and Runtime add-ons

DSE-49301: Missing old topic cleanup for livelog container

Previously, the livelog cleaner service did not remove old topics efficiently. As a result, stale data could accumulate due to earlier livelog cleaner errors, manual deletions from the dashboard table, or prior livelog database compaction. This issue is now resolved. A new configurable cleanup endpoint is introduced to delete topics older than a specified retention period. The solution supports multiple timestamp formats across PBJ, WB, and Engine workloads and can remove workload‑related topics that the livelog cleaner failed to delete, including topics whose primary references are missing.

DSE-51856: Jobs ran with APIv 2 are missing Ozone Runtime add-ons

Previously, jobs run through APIv 2 were missing the Ozone runtime add-ons because API v2 did not load the latest Ozone runtime add-ons at the start of the workload.

This issue is now fixed. APIv2 now correctly validates and includes the latest Ozone runtime add-ons in workloads.

DSE-36561: The Update application operation fails when enable_runtime_addons is disabled

Previously, updating an application failed when the Allow users to use ML runtime addons feature flag was disabled on the Site Administration page in the UI. The issue occurred because the backend expected an add-ons argument was always present, leading to validation errors when the feature flag was turned off.

This issue is now fixed by modifying the backend handler to validate add-ons only when the feature flag is enabled.

Upgrade

DSE-50712: On premises 1.5.5 SP2 Cloudera AI Registry on OpenShift Container Service: 1.5.4 SP2 to 1.5.5 SP2 upgrade fails with knox Init:ImagePullBackOff failure

Previously, upgrading the on premises Cloudera AI Registry from 1.5.4 SP2 to 1.5.5 SP2 failed with an ImagePullBackOff error during Knox initialization. This issue occurred only on OpenShift Container Service 4.19 and higher versions due to strict requirements on the secret type used for the imagePullSecret.

The issue is now fixed. The secret creation process now explicitly specifies the dockerconfigjson secret type or replaces it with a similar API from the Cloudera AI Inference service codebase to ensure compatibility and seamless functionality.

DSE-50441: Unable to register or deploy Model with the 1.5.4_h30 → 1.5.5 SP1 → 1.5.5 SP2 upgrade path and DSE-50487: Add workaround for upgrading AI registry from 1.5.4 - 1.5.5

Previously, an issue in the Cloudera AI Registry upgrade path (1.5.4_h30 → 1.5.5 SP1 → 1.5.5 SP2) could prevent model registration or deployment. During the upgrade, the Cloudera AI Registry provisioned a new instance of Cloudera AI Registry and assigned a fresh Persistent Volume (PV), which was later replaced with the legacy PV to preserve existing data. A race condition allowed the database schema migration to run on the fresh PV before the volume swap completed. When the legacy PV was attached afterward, it did not contain the required schema updates, resulting in query failures and deployment errors.

This issue is now fixed. The upgrade process is updated to automatically restart the v1 pod after the PV transfer so that schema migration runs against the correct volume, ensuring consistent storage state and successful model operations.