CDP Public Cloud: November 2023 Release Summary

Data Warehouse

This release of the Cloudera Data Warehouse (CDW) service on CDP Public Cloud introduces these changes.

Cloudera Data Warehouse Public Cloud 1.8.1-b248 changes, described in more detail below:

Azure AKS 1.27 upgrade

Cloudera supports the kubernetes version 1.27. In 1.8.1-b248 (released November 20, 2023), when you activate an environment, CDW automatically provisions Azure Kubernetes Service (AKS) 1.27. To upgrade to AKS 1.27 from 1.7.3 or earlier, you must backup and restore CDW. To avoid compatibility issues between CDW and AKS, upgrade to version 1.27.

Using the Azure CLI or Azure portal to upgrade the AKS cluster is not supported. Doing so can cause the cluster to become unusable and can cause downtime. For more information about upgrading, see Upgrading an Azure Kubernetes Service cluster for CDW.

New AWS instance type

This release supports the r6id.4xlarge AWS compute instance types. You select this instance type, or another supported one, when you activate your environment in CDW.

Upgraded AWS and Azure environment security

This release upgrades Istio security required for future Kubernetes compatibility. Hive Virtual Warehouses you create in 1.8.1-b248 (released Nov 20, 2023) will run Istio 1.19.0. Because the new Istio version supports only new versions of Hive helm charts, the following limitation exists: If you have the CDW_VERSIONED_DEPLOY, only new Hive image versions appear in UI when you create a new Hive Virtual Warehouse. For more information, see the known issue about Limited Hive image versions.

Diagnostic bundles for troubleshooting Data Visualization problems

You can collect a diagnostic bundle for troubleshooting Data Visualization, as well as Virtual Warehouse, Database Catalog, and environment/cluster. The diagnostic bundle is available for downloading and troubleshooting from the UI. For more information, see Diagnostic bundles for CDW and Kubernetes.

Resizing a Data Visualization instance

The size of the your Data Visualization instance is critical for achieving cost and performance goals. In CDW, after creating a Data Visualization instance you can change its size. You open the Data Visualization instance for editing, and in Size, select the size you want. When you click Apply Changes, the new size takes effect.

Reduced permissions mode enhancement and tag change

The dependent name of the tag key envID has been changed to Cloudera-Resource-Name, which increases strictness. For more information, see Reduced permissions mode template and Minimum set of IAM permissions required for reduced permissions.

dbt adapters for using dbt with Hive, Impala and CDP

You can access the dbt adapters for Hive and Impala from the Cloudera Data Warehouse service, which enable you to use the dbt data management workflow with Cloudera Data Platform. For more information, see Using dbt with Hive, Impala and CDP.

Cloudera Data Warehouse Public Cloud Runtime 2023.0.16.0-150 changes:

Hue

Hue supports natural language query processing (Preview)

Hue leverages the power of Large Language Models (LLM) to help you generate SQL queries from natural language prompts and also provides options to optimize, explain, and fix queries, ensuring efficiency and accuracy in data retrieval and manipulation. You can use several AI services and models such as OpenAI’s GPT service, Amazon Bedrock, and Azure’s OpenAI service to run the Hue SQL AI assistant. See SQL AI Assistant in Data Warehouse Public Cloud.

Ability to deploy Hue at the environment level (Preview)

Previously, you could run Hue only at a Virtual Warehouse-level. As a result, you would lose all query history when you delete or shut down a Virtual Warehouse. CDW now allows you to deploy Hue at the environment level and allows you to select a Virtual Warehouse from the Hue web interface. Query history and saved queries are retained as long as the environment is active. See Deploying Hue at Environment Level in Data Warehouse Public Cloud.

Iceberg

Enhancement of expiring Iceberg snapshots

In this release, you have more flexibility to expire snapshots. In addition to expiring snapshots older than a timestamp, you can now expire snapshots based on the following conditions:

  • A snapshot having a given ID
  • Snapshots having IDs matching a given list of IDs
  • Snapshots within the range of two timestamps

You can keep snapshots you are likely to need, for example recent snapshots, and expire old snapshots. For example, you can keep daily snapshots for the last 30 days, then weekly snapshots for the past year, then monthly snapshots for the last 10 years. You can remove specific snapshots to meet the GDPR right to be forgotten requirements.

Truncate partition support for Iceberg

This release introduces the capability to truncate an Iceberg table. Truncation removes all rows from the table. A new snapshot is created. Truncation works for partitioned and unpartitioned tables.

Insert into partition and insert overwrite partition support for Iceberg

From Hive you can insert into, or overwrite data in, Iceberg tables that are statically or dynamically partitioned. For syntax and limitations, see Insert into/overwrite partition support.

Iceberg branching and tagging technical preview

From Hive, you can manage the lifecycle of snapshots using the Iceberg branching and Iceberg tagging features. Branches are references to snapshots that have a lifecycle of their own. Tags identify snapshots you need for auditing and conforming to GDPR. Branching and tagging is available as a technical preview. Cloudera recommends that you use this feature in test and development environments. It is not recommended for production deployments.

Impala

Impala performance optimization for VALUES() expressions

Rewriting expressions in the Impala VALUES() clause can affect performance. If Impala evaluates an expression only once, especially a constant expression, the overhead might outweigh the potential benefits of a rewrite. Consequently, in this release, any attempts to rewrite expressions within the VALUES() clause has no impact. Impala skips expression rewrites entirely for VALUES during the analysis phase except rewrites of expressions separated by a compound vertical bar (   ). These expressions are ultimately evaluated and materialized in the backend instead of the analysis phase.

The drop in performance caused by rewriting expressions might not follow a straightforward linear pattern. This drop becomes more pronounced as the number of columns increases. Using code generation for constant expressions in this context does not provide significant value. As part of this optimization, code generation is turned off for constant expressions within a UNION node if the UNION node is not within a subplan. This applies to all UNION nodes with constant expressions, not just those associated with a VALUES clause.

Here are some examples of queries of disabled expression rewrites and code generation for the UNION operator and VALUES clause.

select 1+2+3+4+5,2*1-1,3*3 union all select 1+2+3+4,5,6 union all select 7+1-2,8+1+1,9-1-1-1;

insert into test_values_codegen values (1+1, '2015-04-09 14:07:46.580465000', base64encode('hello world')), (CAST(1*2+2-5 as INT), CAST(1428421382 as timestamp), regexp_extract('abcdef123ghi456jkl','.*?(\\d+)',0));

Impala skips scheduling bloom filter from full-build scan

PK-FK joins between a dimension table and a fact table are common occurrences in a query. Such joins often do not involve any predicate filters in the dimension table. As a result, a bloom filter generated from this kind of dimension table scan (PK) will most likely contain all values from the fact table column (FK). It becomes ineffective to generate this filter because it is unlikely to reject any rows, especially if the bloom filter size is large and has a high false positive probability (FPP) estimate.

As part of this optimization, Impala skips scheduling bloom filter from join node that has certain characteristics.

Impala events marked as ‘Skip’ occur prior to a manual REFRESH

If a table has been manually refreshed, the event processor skips any events occurring prior to the manual refresh. This optimization helps catalogd when it lags behind in processing events. Now, event processing has been optimized to determine if any manual refresh was executed after its eventTime. This, in turn, assists CatalogD in swiftly catching up with the HMS events. To activate this optimization, you must set enable_skipping_older_events to true.

Allow setting separate mem_limit for coordinators

The current mem_limit query option applies to all Impala coordinators and executors. This means that the same amount of memory gets reserved, but coordinators typically only handle the task of coordinating the query and thus may not necessarily require all the estimated memory. When we block the estimated memory on coordinators, it hinders the memory available for use by other queries.

The new MEM_LIMIT_COORDINATORS query option functions similarly to the MEM_LIMIT option but sets the query memory limit only on coordinators. This new option addresses the issue related to MEM_LIMIT and is recommended in scenarios where the query needs higher or lower memory on coordinators compared to the planner estimates. Note The MEM_LIMIT_COORDINATORS query option does not work in conjunction with MEM_LIMIT. If you set both, only MEM_LIMIT will be applied.

DataFlow

This release (2.6.1-b92) of Cloudera DataFlow (CDF) on CDP Public Cloud introduces fine-grained access controls over resources within an environment, support for creating and managing reporting tasks through the CLI, besides other improvements and fixes.

Note: This release of CDF supports deployments running NiFi 1.18.0.2.3.7.0-100 or newer. If your DataFlow service has older NiFi versions, you can perform a Change NiFi Version to bring each into compliance or select to update to the latest as part of DataFlow Upgrade.

What’s new

  • Support for granular access controls over resources have been introduced to the DF Service. On the new Projects page users can define Projects that limit the visibility of Flow Drafts, Deployments, Inbound Connections, and Custom NAR Configurations within an Environment. Dashboard, Flow Designs and Workspace Resources pages now have new filters and controls to organize resources into Projects.

  • Added support for creating and deleting reporting tasks in a CDF deployment through the CLI. Listing reporting tasks is also available in the UI under the NiFi Configuration tab in deployment manager.

Changes and improvements

  • Third party and base images updated to address CVEs.

  • Improvements for larger scale CDF clusters of up to 50 nodes, including how Prometheus instances are monitored and vertical scaling cluster resources in terms of CPU.

  • View NiFi UI is a valid action when the deployment is in a failed to upgrade state.

  • Port 80 is removed from the security group for the load balancer that is created by DFX on AWS. It redirected requests to port 443, but port 80 was not necessary to be open.

Management Console

This release of the Management Console service introduces the following changes:

Scale an existing Data Lake from single-AZ to multi-AZ

As part of the Data Lake scaling via CDP CLI, you can optionally scale a Data Lake from single-AZ to multi-AZ by adding the –multi-az flag to the Data Lake resize command. This is available via CDP CLI only. For more information, see Data lake scaling and Scaling the Data Lake through the CDP CLI.