November 20, 2023

This release of the Cloudera Data Warehouse (CDW) service on CDP Public Cloud introduces these changes.

Cloudera Data Warehouse Public Cloud 1.8.1-b248 changes, described in more detail below:

Cloudera Data Warehouse Public Cloud Runtime 2023.0.16.0-150 changes:

Iceberg

Impala

Azure AKS 1.27 upgrade

Cloudera supports the kubernetes version 1.27. In 1.8.1-b248 (released November 20, 2023), when you activate an environment, CDW automatically provisions Azure Kubernetes Service (AKS) 1.27. To upgrade to AKS 1.27 from 1.7.3 or earlier, you must backup and restore CDW. To avoid compatibility issues between CDW and AKS, upgrade to version 1.27.

Using the Azure CLI or Azure portal to upgrade the AKS cluster is not supported. Doing so can cause the cluster to become unusable and can cause downtime. For more information about upgrading, see Upgrading an Azure Kubernetes Service cluster for CDW.

New AWS instance type

This release supports the r6id.4xlarge AWS compute instance types. You select this instance type, or another supported one, when you activate your environment in CDW.

Private EKS API Server technical preview

In 1.8.1-b248 (released November 20, 2023), you can establish private connectivity between CDP services running on you cluster and AWS to prevent exposing data on the internet. To set up the Amazon Elastic Kubernetes Service (EKS) cluster in private mode and to enable the private EKS, run the following Beta CDP CLI command:
cdp dw create-cluster -\-aws-options "enablePrivateEKS=true" -\-environment-crn "xyz"

The set up configures awsOptions. In this mode, a private endpoint uses the cluster-proxy (ccmv2) networking for control-plane to cluster communication. The freeipa security group is authorized in the eks-cluster’s ingress rule, which opens the channel of communication from the control plane side through the freeipa pod.

This feature is a technical preview and not recommended for production deployments. Cloudera recommends that you use this feature in test and development environments only.

Upgraded AWS and Azure environment security

This release upgrades Istio security required for future Kubernetes compatibility. Hive Virtual Warehouses you create in 1.8.1-b248 (released Nov 20, 2023) will run Istio 1.19.0. Because the new Istio version supports only new versions of Hive helm charts, the following limitation exists: If you have the CDW_VERSIONED_DEPLOY, only new Hive image versions appear in UI when you create a new Hive Virtual Warehouse. For more information, see the known issue about Limited Hive image versions.

Diagnostic bundles for troubleshooting Data Visualization problems

You can collect a diagnostic bundle for troubleshooting Data Visualization, as well as Virtual Warehouse, Database Catalog, and environment/cluster. The diagnostic bundle is available for downloading and troubleshooting from the UI. For more information, see Diagnostic bundles for CDW and Kubernetes.

Resizing a Data Visualization instance

The size of the your Data Visualization instance is critical for achieving cost and performance goals. In CDW, after creating a Data Visualization instance you can change its size. You open the Data Visualization instance for editing, and in Size, select the size you want. When you click Apply Changes, the new size takes effect.

Reduced permissions mode enhancement and tag change

The dependent name of the tag key envID has been changed to Cloudera-Resource-Name, which increases strictness. For more information, see Reduced permissions mode template and Minimum set of IAM permissions required for reduced permissions.

dbt adapters for using dbt with Hive, Impala and CDP

You can access the dbt adapters for Hive and Impala from the Cloudera Data Warehouse service, which enable you to use the dbt data management workflow with Cloudera Data Platform. For more information, see Using dbt with Hive, Impala and CDP.

Support for Hive external data sources using data connectors

You can use Hive data connectors to map databases present in external data sources to a local Hive Metastore (HMS). The external data sources can be of different types, such as MySQL, PostgreSQL, Oracle, Redshift, Derby, or other HMS instances. You can create external tables to represent the data, and then query the tables. For more information, see Using Hive data connectors to support external data sources.

Enhancement of expiring Iceberg snapshots

In this release, you have more flexibility to expire snapshots. In addition to expiring snapshots older than a timestamp, you can now expire snapshots based on the following conditions:
  • A snapshot having a given ID
  • Snapshots having IDs matching a given list of IDs
  • Snapshots within the range of two timestamps

You can keep snapshots you are likely to need, for example recent snapshots, and expire old snapshots. For example, you can keep daily snapshots for the last 30 days, then weekly snapshots for the past year, then monthly snapshots for the last 10 years. You can remove specific snapshots to meet the GDPR right to be forgotten requirements.

Truncate partition support for Iceberg

This release introduces the capability to truncate an Iceberg table. Truncation removes all rows from the table. A new snapshot is created. Truncation works for partitioned and unpartitioned tables.

Insert into partition and insert overwrite partition support for Iceberg

From Hive you can insert into, or overwrite data in, Iceberg tables that are statically or dynamically partitioned. For syntax and limitations, see "Insert into/overwrite partition support".

Iceberg branching and tagging technical preview

From Hive, you can manage the lifecycle of snapshots using the Iceberg branching and Iceberg tagging features. Branches are references to snapshots that have a lifecycle of their own. Tags identify snapshots you need for auditing and conforming to GDPR. Branching and tagging is available as a technical preview. Cloudera recommends that you use this feature in test and development environments. It is not recommended for production deployments.

Impala performance optimization for VALUES() expressions

Rewriting expressions in the Impala VALUES() clause can affect performance. If Impala evaluates an expression only once, especially a constant expression, the overhead might outweigh the potential benefits of a rewrite. Consequently, in this release, any attempts to rewrite expressions within the VALUES() clause has no impact. Impala skips expression rewrites entirely for VALUES during the analysis phase except rewrites of expressions separated by a compound vertical bar (||). These expressions are ultimately evaluated and materialized in the backend instead of the analysis phase.

The drop in performance caused by rewriting expressions might not follow a straightforward linear pattern. This drop becomes more pronounced as the number of columns increases. Using code generation for constant expressions in this context does not provide significant value. As part of this optimization, code generation is turned off for constant expressions within a UNION node if the UNION node is not within a subplan. This applies to all UNION nodes with constant expressions, not just those associated with a VALUES clause.

Here are some examples of queries of disabled expression rewrites and code generation for the UNION operator and VALUES clause.

select 1+2+3+4+5,2*1-1,3*3 union all select 1+2+3+4,5,6 union all select 7+1-2,8+1+1,9-1-1-1;
insert into test_values_codegen values
  (1+1, '2015-04-09 14:07:46.580465000', base64encode('hello world')),
  (CAST(1*2+2-5 as INT), CAST(1428421382 as timestamp),
   regexp_extract('abcdef123ghi456jkl','.*?(\\d+)',0));

Impala skips scheduling bloom filter from full-build scan

PK-FK joins between a dimension table and a fact table are common occurrences in a query. Such joins often do not involve any predicate filters in the dimension table. As a result, a bloom filter generated from this kind of dimension table scan (PK) will most likely contain all values from the fact table column (FK). It becomes ineffective to generate this filter because it is unlikely to reject any rows, especially if the bloom filter size is large and has a high false positive probability (FPP) estimate.

As part of this optimization, Impala skips scheduling bloom filter from join node that has certain characteristics.

Impala events marked as 'Skip' occur prior to a manual REFRESH

If a table has been manually refreshed, the event processor skips any events occurring prior to the manual refresh. This optimization helps catalogd when it lags behind in processing events. Now, event processing has been optimized to determine if any manual refresh was executed after its eventTime. This, in turn, assists CatalogD in swiftly catching up with the HMS events. To activate this optimization, you must set enable_skipping_older_events to true.

Allow setting separate mem_limit for coordinators

The current mem_limit query option applies to all Impala coordinators and executors. This means that the same amount of memory gets reserved, but coordinators typically only handle the task of coordinating the query and thus may not necessarily require all the estimated memory. When we block the estimated memory on coordinators, it hinders the memory available for use by other queries.

The new MEM_LIMIT_COORDINATORS query option functions similarly to the MEM_LIMIT option but sets the query memory limit only on coordinators. This new option addresses the issue related to MEM_LIMIT and is recommended in scenarios where the query needs higher or lower memory on coordinators compared to the planner estimates.

Hue supports natural language query processing (Preview)

Hue leverages the power of Large Language Models (LLM) to help you generate SQL queries from natural language prompts and also provides options to optimize, explain, and fix queries, ensuring efficiency and accuracy in data retrieval and manipulation. You can use several AI services and models such as OpenAI’s GPT service, Amazon Bedrock, and Azure’s OpenAI service to run the Hue SQL AI assistant. See SQL AI Assistant in Data Warehouse Public Cloud.

Ability to deploy Hue at the environment level (Preview)

Previously, you could run Hue only at a Virtual Warehouse-level. As a result, you would lose all query history when you delete or shut down a Virtual Warehouse. CDW now allows you to deploy Hue at the environment level and allows you to select a Virtual Warehouse from the Hue web interface. Query history and saved queries are retained as long as the environment is active. See Deploying Hue at Environment Level in Data Warehouse Public Cloud.