Cloudera Data Services on premises 1.5.5 SP2 Release Summary
This release summary for Cloudera Data Services on premises 1.5.5 SP2 introduces significant enhancements across the Cloudera Data Services on premises platform, focusing on Cloudera Observability, Cloudera AI, Cloudera Data Engineering, Cloudera Data Warehouse, and Cloudera Data Catalog.
Platform Cloudera Control Plane
- Certificate Management Support for OpenShift Container Platform (OCP): Cloudera now provides out-of-the-box support for Venafi Trust Protection Platform (TPP) and integrates with the open-source tool cert-manager to automate the provisioning, management, and renewal of TLS certificates for secure communication.
- Custom Annotation Support in Certificate Manager: This feature allows custom mandatory Venafi fields to be included in certificate issuance API requests for TPP by automatically injecting the required custom field annotations (e.g., venafi.cert-manager.io/custom-fields) into CertificateRequest objects.
- Dedicated Node Support in Embedded Container Service (ECS): New options have been added under the node_taint property for Cloudera Embedded Container Service Host Configuration to reserve nodes exclusively for:
- Dedicated Cloudera Data Engineering Node for Data Engineering services.
- Dedicated Cloudera AI Infra Node for Cloudera AI infrastructure services.
- Ingress Certificate Secret Synchronization: A reflector has been added to continuously mirror secrets from the openshift-ingress namespace to the istio-ingress namespace, ensuring that the default OpenShift certificate is used by the Istio ingress for TLS termination.
Cloudera Observability
- Cloudera AI on premises Observability Non Airgapped integration: Cloudera Observability now integrates with Cloudera AI on-premises in non-airgapped environments to help optimize AI workbench performance and manage resource utilization.
- Monitoring Cloudera Data Warehouse with Cloudera Observability: This integration allows for collecting query and job-level telemetry. The resulting dashboards can be used to optimize Hive and Impala queries, monitor performance, and ensure SLA compliance.
- Monitoring Cloudera Data Engineering with Cloudera Observability: Integrating CDE with Observability collects Spark job telemetry and monitors resource usage. This enables the optimization of Spark workloads, troubleshooting issues, and access to historical analysis directly from the CDE job runs page.
Cloudera AI
Generative AI Readiness
- Cloudera AI Inference Service is now Generally Available (GA), enabling access to new features, enhanced performance, and updated components.
- Support for importing extra large LLMs is more reliable through modifications to the model download process to prevent massive file errors and the ability to configure the Cloudera AI Registry service with only one replica to manage storage constraints.
- New documentation is available detailing the supported transformer model architectures (vLLM 0.8.5 and vLLM 0.8.4).
Usability and Performance
- Improved UI performance reduces waiting times with faster page loads for key landing, session launch, and listing pages.
- Hadoop and Spark runtimes are aligned with Cloudera Data Engineering (CDE) and Data Hub builds to ensure seamless code movement and integration.
- Traffic management is more robust with enhanced encryption, deeper observability, and the addition of Istio Gateway API support alongside NGINX for Embedded Container Service-based environments.
- Diagnostic bundle support is available for both Cloudera AI Registry and Inference Service to streamline troubleshooting.
Governance and Resource Control
- Workbench Quota Management is introduced as a Technical Preview, allowing administrators to define maximum limits for memory and compute usage per workbench.
- Team quotas can now be applied in mixed-GPU environments, enabling precise allocation of specific hardware types to different teams.
Security Improvements
- Non-transparent proxy support on the Cloudera AI Inference service ensures strict network control by enforcing all outbound communication through the proxy, which is crucial for air-gapped environments.
- Routine security improvements have been implemented to address Common Vulnerabilities and Exposures (CVE) remediations.
Cloudera Data Engineering
- Monitoring with Cloudera Observability: You can now monitor Cloudera Data Engineering metrics with Cloudera Observability, enabling the collection of service metrics, viewing resource usage for virtual clusters, optimizing Spark workloads, and accessing historical job analysis.
- OpenShift Container Platform (OCP) Integration:
- Support for integrating a third-party certificate manager (Venafi TPP).
- Support for Privacy Settings for Virtual Clusters.
- Support for User Access Management and Artifact access management.
- Dedicated ECS Nodes: New support is added for dedicating Cloudera Embedded Container Service nodes exclusively for Cloudera Data Engineering.
- Component Upgrades:
- Support for Spark 3.5.x with Cloudera Runtime 7.1.9 SP1 (for security hardened images).
- The underlying Kubernetes version used by Cloudera Data Engineering is upgraded to 1.32.
Important Upgrade and Configuration Changes
- Endpoint Stability Not Supported: For upgrades to CDE on OCP to 1.5.5 SP2 or higher, endpoint stability is not supported. The links to your CDE Service and Virtual Cluster will change after the upgrade.
- Default Disabled Feature: GPU acceleration support is disabled by default. You must contact Cloudera Support to enable this feature.
- Command Change: The restore-service command is changed to clone-service for upgrades with OCP.
Cloudera Data Warehouse
- Secure Ozone S3A Configuration: Supports a Kubernetes-native security framework for authenticating Hive and Impala Virtual Warehouses with Apache Ozone, replacing plain-text credentials with an encrypted Java KeyStore (JCEKS) vault managed as a Kubernetes Secret.
- Diagnostic Bundle Support: Now supports generating diagnostic bundles for Cloudera Data Visualization workloads.
- Enforced Minimum Timeout: A minimum timeout of 120 seconds is enforced for Virtual Warehouse scale-down configurations (Trigger Shutdown Delay for Impala and Auto-suspend Timeout for Hive).
- Custom Kerberos Support: Allows administrators to specify a custom hostname for the Kerberos service principal during environment activation to ensure compatibility with enterprise DNS and security standards.
Hive
- Upgrading Calcite: Hive is upgraded to Calcite version 1.33, which introduces various query optimizations for improved performance.
- Hive on ARM Architecture: Hive is now fully supported on ARM architecture instances, including AWS Graviton and Azure ARM.
Impala
- Impala AES Encryption and Decryption: Supports AES encryption and decryption, with AES-GCM as the default mode.
- Query Cancellation Enhancement: Queries can now be cancelled even while they are in the Frontend stage (analysis and planning).
- Improved Memory Estimation: Uses a more realistic approach to memory estimation for large operations (SORT, AGGREGATION, HASH JOIN) by considering the operator’s ability to spill data to disk and capping the memory estimate based on cluster limits. A new query option, MEM_ESTIMATE_SCALE_FOR_SPILLING_OPERATOR, is introduced for control.
- Expose Query Cancellation Status to UDF: The query cancellation status is now exposed to the User-Defined Function (UDF) interface, allowing complex UDFs to stop work and fail fast if the query is cancelled.
- Legacy Timestamp Conversion Support: Impala now checks for and supports Hive’s legacy timestamp conversion method when reading Parquet or Avro files, ensuring consistent interpretation of historical timestamps.
- Impala-shell Output: Impala-shell now shows row count and elapsed time for most statements in HiveServer2 mode when using the --beeswax_compat_num_rows option.
- Arbitrary Encodings: Supports reading from and writing to Text and Sequence files using arbitrary character encodings (e.g., GBK) beyond the default UTF-8.
- Expanded Compression Levels: Extended configurable range of compression levels for ZSTD and ZLIB (GZIP/DEFLATE) codecs.
- Constant Folding Support: Constant folding optimization is now supported for non-ASCII and binary strings, which improves filter performance.
- Catalogd and Event Processor Improvements: Includes optimizations for Faster Inserts for Partitioned Tables (using batch insert API), Quicker Table Administration (batching updates for DROP STATS/CACHED status), Reliable Table Renames, Efficient Partition Refreshes, and support for REFRESH on multiple partitions in a single command.
- New Query Options for Metadata Synchronization: New options, SYNC_HMS_EVENTS_WAIT_TIME_S and SYNC_HMS_EVENTS_STRICT_MODE, allow users to configure Impala to wait for metadata synchronization from the Hive Metastore (HMS) to ensure consistency.
Trino
General Availability (GA) of Trino in Cloudera Data Warehouse (Transitioned from Technical Preview in 1.5.5 SP1)
The GA release of Trino in Cloudera Data Warehouse introduces several key capabilities:
- Trino Virtual Warehouses: Full support is now available for creating and managing Trino Virtual Warehouses, enabling efficient querying across diverse, large datasets.
- Federation and Connectivity: Seamless connection and management of various remote data sources is possible through Trino Federation Connectors, including the new Teradata custom connector. A dedicated connector management UI and backend facilitates the creation and configuration of these connectors.
- Security and Governance: Governance is enforced by default through Apache Ranger. Platform-wide Ranger policies are automatically applied and inherited by Trino, ensuring access authorization, column masking, and row filtering.
- Performance Optimization: Built-in capabilities for auto-suspend and auto-scaling are supported.
- Support for Ozone file system: You can now configure the Hive Metastore (HMS) in Database Catalogs to use the Ozone filesystem for Trino Virtual Warehouses. By default, HMS points to HDFS, but this feature allows you to set Ozone as the default storage system, enabling efficient and scalable data management.
- Support for Teradata connector (Technical Preview): Cloudera Data Warehouse now introduces support for a read-only Trino-Teradata connector. This feature is designed to facilitate SELECT operations on Teradata sources, operating in ANSI Mode and optimizing performance by pushing down filters and aggregates.
- Connection pooling for JDBC-based connectors: You can now configure connection pooling capabilities for JDBC-based Trino connectors, such as MySQL, PostgreSQL, MariaDB, Teradata, and Oracle. Connection pooling helps in better performance, resource utilization, and stability while querying different data sources using Trino.
- Cloudera Data Visualization Integration: Trino connector is available for use within Cloudera Data Visualization for interactive dashboarding and analytics.
Iceberg
- Iceberg Scan Metrics in Query Profiles: Impala query profiles now integrate Iceberg scan metrics (such as total planning time, file counts, and skipped files) into the Frontend section, providing deeper insight into query planning performance.
- Delete Orphan Files: New syntax is available to remove orphan files from Iceberg tables based on an older_than timestamp or time interval (e.g., now() - interval 5 days), helping to manage table directory size.
- Forced Predicate Pushdown: A new table property, impala.iceberg.push_down_hint, allows users to force predicate pushdown for specific columns, overriding the default optimization behavior to potentially improve file pruning and query performance.
- Optimized UPDATE Operations: The UPDATE statement for Iceberg tables is optimized to automatically skip rows that already contain the target value, which reduces unnecessary writes of new data and delete records.
Cloudera Data Catalog
Redesigned and expanded Tag Rules for Compute Cluster enabled environments within Cloudera Data Catalog:
- New Tagging Capability: You can now profile table names in addition to column values or column names.
- Granular Tags: Atlas classifications (Cloudera Data Catalog tags) can be used more granularly with the distinction between parent and child tags.
- Performance: The initial loading time of rules has been decreased.
- Configuration Flexibility:
- You can specify weightage for column value based matching (where column weightage and column name weightage total 100%).
- When profiling column values, you can now upload a sample set of column values instead of defining a regex pattern.
- Simplified Management:
- You can upload regex patterns in CSV files for easier handling.
- The new Tag Rules tab offers filters for faster searching and displays applied parent/child tags, tag rule status, and rule types.
- Deployment Safety:
- You can review your configuration before finalizing a tag rule.
- A Dry Run is required to test tag rules with actual table data before deploying them.
- API Access: New API calls are available related to these features.
