November 20, 2023
This release of the Cloudera Data Warehouse (CDW) service on CDP Public Cloud introduces these changes.
Cloudera Data Warehouse Public Cloud 1.8.1-b248 changes, described in more detail below:
- Azure AKS 1.27 upgrade
- New AWS instance type
- Upgraded AWS and Azure environment security
- Diagnostic bundles for troubleshooting Data Visualization problems
- Resizing a Data Visualization instance
- Reduced permissions mode enhancement and tag change
- dbt adapters for using dbt with Hive, Impala and CDP
Cloudera Data Warehouse Public Cloud Runtime 2023.0.16.0-150 changes:
Iceberg
- Enhancement of expiring Iceberg snapshots
- Truncate partition support for Iceberg
- Insert into partition and insert overwrite partition support for Iceberg
- Iceberg branching and tagging technical preview
Impala
- Impala performance optimization for VALUES() expressions
- Impala skips scheduling bloom filter from full-build scan
- Impala events marked as 'Skip' occur prior to a manual REFRESH
- Allow setting separate mem_limit for coordinators
Azure AKS 1.27 upgrade
Cloudera supports the kubernetes version 1.27. In 1.8.1-b248 (released November 20, 2023), when you activate an environment, CDW automatically provisions Azure Kubernetes Service (AKS) 1.27. To upgrade to AKS 1.27 from 1.7.3 or earlier, you must backup and restore CDW. To avoid compatibility issues between CDW and AKS, upgrade to version 1.27.
Using the Azure CLI or Azure portal to upgrade the AKS cluster is not supported. Doing so can cause the cluster to become unusable and can cause downtime. For more information about upgrading, see Upgrading an Azure Kubernetes Service cluster for CDW.
New AWS instance type
This release supports the r6id.4xlarge AWS compute instance types. You select this instance type, or another supported one, when you activate your environment in CDW.
Private EKS API Server technical preview
cdp dw create-cluster -\-aws-options "enablePrivateEKS=true" -\-environment-crn "xyz"
The set up configures awsOptions. In this mode, a private endpoint uses the cluster-proxy (ccmv2) networking for control-plane to cluster communication. The freeipa security group is authorized in the eks-cluster’s ingress rule, which opens the channel of communication from the control plane side through the freeipa pod.
This feature is a technical preview and not recommended for production deployments. Cloudera recommends that you use this feature in test and development environments only.
Upgraded AWS and Azure environment security
This release upgrades Istio security required for future Kubernetes compatibility. Hive Virtual Warehouses you create in 1.8.1-b248 (released Nov 20, 2023) will run Istio 1.19.0. Because the new Istio version supports only new versions of Hive helm charts, the following limitation exists: If you have the CDW_VERSIONED_DEPLOY, only new Hive image versions appear in UI when you create a new Hive Virtual Warehouse. For more information, see the known issue about Limited Hive image versions.
Diagnostic bundles for troubleshooting Data Visualization problems
You can collect a diagnostic bundle for troubleshooting Data Visualization, as well as Virtual Warehouse, Database Catalog, and environment/cluster. The diagnostic bundle is available for downloading and troubleshooting from the UI. For more information, see Diagnostic bundles for CDW and Kubernetes.
Resizing a Data Visualization instance
The size of the your Data Visualization instance is critical for achieving cost and performance goals. In CDW, after creating a Data Visualization instance you can change its size. You open the Data Visualization instance for editing, and in Size, select the size you want. When you click Apply Changes, the new size takes effect.
Reduced permissions mode enhancement and tag change
The dependent name of the tag key envID has been changed to Cloudera-Resource-Name, which increases strictness. For more information, see Reduced permissions mode template and Minimum set of IAM permissions required for reduced permissions.
dbt adapters for using dbt with Hive, Impala and CDP
You can access the dbt adapters for Hive and Impala from the Cloudera Data Warehouse service, which enable you to use the dbt data management workflow with Cloudera Data Platform. For more information, see Using dbt with Hive, Impala and CDP.
Support for Hive external data sources using data connectors
You can use Hive data connectors to map databases present in external data sources to a local Hive Metastore (HMS). The external data sources can be of different types, such as MySQL, PostgreSQL, Oracle, Redshift, Derby, or other HMS instances. You can create external tables to represent the data, and then query the tables. For more information, see Using Hive data connectors to support external data sources.
Enhancement of expiring Iceberg snapshots
- A snapshot having a given ID
- Snapshots having IDs matching a given list of IDs
- Snapshots within the range of two timestamps
You can keep snapshots you are likely to need, for example recent snapshots, and expire old snapshots. For example, you can keep daily snapshots for the last 30 days, then weekly snapshots for the past year, then monthly snapshots for the last 10 years. You can remove specific snapshots to meet the GDPR right to be forgotten requirements.
Truncate partition support for Iceberg
This release introduces the capability to truncate an Iceberg table. Truncation removes all rows from the table. A new snapshot is created. Truncation works for partitioned and unpartitioned tables.
Insert into partition and insert overwrite partition support for Iceberg
From Hive you can insert into, or overwrite data in, Iceberg tables that are statically or dynamically partitioned. For syntax and limitations, see "Insert into/overwrite partition support".
Iceberg branching and tagging technical preview
From Hive, you can manage the lifecycle of snapshots using the Iceberg branching and Iceberg tagging features. Branches are references to snapshots that have a lifecycle of their own. Tags identify snapshots you need for auditing and conforming to GDPR. Branching and tagging is available as a technical preview. Cloudera recommends that you use this feature in test and development environments. It is not recommended for production deployments.
Impala performance optimization for VALUES() expressions
Rewriting expressions in the Impala VALUES() clause can affect performance. If Impala evaluates an expression only once, especially a constant expression, the overhead might outweigh the potential benefits of a rewrite. Consequently, in this release, any attempts to rewrite expressions within the VALUES() clause has no impact. Impala skips expression rewrites entirely for VALUES during the analysis phase except rewrites of expressions separated by a compound vertical bar (||). These expressions are ultimately evaluated and materialized in the backend instead of the analysis phase.
The drop in performance caused by rewriting expressions might not follow a straightforward linear pattern. This drop becomes more pronounced as the number of columns increases. Using code generation for constant expressions in this context does not provide significant value. As part of this optimization, code generation is turned off for constant expressions within a UNION node if the UNION node is not within a subplan. This applies to all UNION nodes with constant expressions, not just those associated with a VALUES clause.
Here are some examples of queries of disabled expression rewrites and code generation for the UNION operator and VALUES clause.
select 1+2+3+4+5,2*1-1,3*3 union all select 1+2+3+4,5,6 union all select 7+1-2,8+1+1,9-1-1-1;
insert into test_values_codegen values
(1+1, '2015-04-09 14:07:46.580465000', base64encode('hello world')),
(CAST(1*2+2-5 as INT), CAST(1428421382 as timestamp),
regexp_extract('abcdef123ghi456jkl','.*?(\\d+)',0));
Impala skips scheduling bloom filter from full-build scan
PK-FK joins between a dimension table and a fact table are common occurrences in a query. Such joins often do not involve any predicate filters in the dimension table. As a result, a bloom filter generated from this kind of dimension table scan (PK) will most likely contain all values from the fact table column (FK). It becomes ineffective to generate this filter because it is unlikely to reject any rows, especially if the bloom filter size is large and has a high false positive probability (FPP) estimate.
As part of this optimization, Impala skips scheduling bloom filter from join node that has certain characteristics.
Impala events marked as 'Skip' occur prior to a manual REFRESH
If a table has been manually refreshed, the event processor skips any events occurring
prior to the manual refresh. This optimization helps catalogd when it lags behind in
processing events. Now, event processing has been optimized to determine if any manual
refresh was executed after its eventTime. This, in turn, assists CatalogD in swiftly
catching up with the HMS events. To activate this optimization, you must set
enable_skipping_older_events
to true.
Allow setting separate mem_limit for coordinators
The current mem_limit query option applies to all Impala coordinators and executors. This means that the same amount of memory gets reserved, but coordinators typically only handle the task of coordinating the query and thus may not necessarily require all the estimated memory. When we block the estimated memory on coordinators, it hinders the memory available for use by other queries.
MEM_LIMIT_COORDINATORS
query option functions similarly to the
MEM_LIMIT
option but sets the query memory limit only on coordinators. This
new option addresses the issue related to MEM_LIMIT
and is recommended in
scenarios where the query needs higher or lower memory on coordinators compared to the planner
estimates.Hue supports natural language query processing (Preview)
Hue leverages the power of Large Language Models (LLM) to help you generate SQL queries from natural language prompts and also provides options to optimize, explain, and fix queries, ensuring efficiency and accuracy in data retrieval and manipulation. You can use several AI services and models such as OpenAI’s GPT service, Amazon Bedrock, and Azure’s OpenAI service to run the Hue SQL AI assistant. See SQL AI Assistant in Data Warehouse Public Cloud.
Ability to deploy Hue at the environment level (Preview)
Previously, you could run Hue only at a Virtual Warehouse-level. As a result, you would lose all query history when you delete or shut down a Virtual Warehouse. CDW now allows you to deploy Hue at the environment level and allows you to select a Virtual Warehouse from the Hue web interface. Query history and saved queries are retained as long as the environment is active. See Deploying Hue at Environment Level in Data Warehouse Public Cloud.