What's new in Cloudera Data Warehouse on Public Cloud

This release of the Cloudera Data Warehouse (CDW) service on CDP Public Cloud introduces these changes.

Cloudera Data Warehouse Public Cloud 1.8.5-b35 changes, described in more detail below:

AWS Elastic Kubernetes Service 1.28 support
Accessing S3 buckets in an AWS environment
Database Catalog size configuration
GA availability of a private CDW environment in Azure Kubernetes Service
IAM policies moved from documentation to a public Github repository
Metrics for monitoring and troubleshooting Impala Virtual Warehouses
CDP CLI commands for creating and updating an AWS and Azure cluster
CDP CLI commands for configuring Workload Aware Auto-Scaling

Cloudera Data Warehouse Runtime 2024.0.17.0-73 changed, described in more detail below:

Iceberg

Iceberg drop partition feature
Iceberg branching and tagging GA
Enhanced security of Iceberg metadata
Impala support for changing the Iceberg table metadata location
Support for copy-on-write (COW)
Query metadata tables feature
Directed distribution mode

Impala

SHOW VIEWS statement
Planner changes to improve cardinality estimation
Distribute runtime filter aggregation
Improvement in catalog observability
Caching codegen functions

AWS Elastic Kubernetes Service 1.28 support🔗

New CDW clusters using AWS environments you activate in this release 1.8.4-b35 (released February, 2024) of CDW will use Amazon Kubernetes (EKS) version 1.28. For more information, see Upgrading Amazon Kubernetes Service.

Accessing S3 buckets in an AWS environment🔗

You can use the CDW UI for configuring access to S3 buckets under certain conditions. You can also configure your own custom encryption key for read/write access from CDW Public Cloud on AWS to the external S3 bucket.

For more information, see Accessing S3 buckets.

Database Catalog size configuration🔗

Using the following CDP CLI commands, you can configure the Java heap for your Database Catalog to small 8Gb (default), medium 16Gb, or large 24 Gb:

Create a default Database Catalog that configures, for example, a large Java Heap size.

dw create-dbc --cluster-id=env-sb42vs --name=SSlarge --memory-tshirt-size=large

Create a Database Catalog that accepts the default Java Heap size, which is small, by not configuring the size.

dw create-dbc --cluster-id=env-sb42vs --name=SSdefault

List Database Catalog configuration information:

dw list-dbcs --cluster-id=env-sb42vs

Using the CDW UI as described in Creating a Database Catalog, you can configure the Database Catalog size.

To avoid unnecessary cloud expenses, do not increase the size unless you experience Java heap issues.

GA availability of a private CDW environment in Azure Kubernetes Service 🔗

You can now enable a private CDW environment in Azure Kubernetes Service (AKS). You use the Private CDW option in the CLI, and CDW deploys an Azure Kubernetes Cluster with only private endpoints enabled. The cluster can be accessed only from your Azure network.

IAM policies moved from documentation to a public Github repository🔗

To meet customer requests for tracking changes to IAM policies for CDW, the policies now reside in Github. CDW documentation, such as "Attaching the policy to your cross-account role" provides links to the policies.

Metrics for monitoring and troubleshooting Impala Virtual Warehouses🔗

Additional global metrics and table level event metrics are available for debugging an Impala Virtual Warehouse.

CDP CLI commands for creating and updating an AWS and Azure cluster🔗

You can create and update a CDW cluster in Amazon and Azure environments using the following commands:

create-aws-cluster --environment-crn <value> [options]
create-azure-cluster --environment-crn <value> --user-assigned-managed-identity <value> [options]
update-aws-cluster [options]
update-azure-cluster [options]

For more information about using the commands, including options, see the CDP CLI reference.

CDP CLI commands for configuring Workload Aware Auto-Scaling🔗

You can perform the following Workload Aware Auto-Scaling (WAAS) configurations as part of a Virtual Warehouse create request:

Create a Virtual Warehouse with executor group sets
Choose how many executor group sets to configure
Configure each executor group set.
Modify, add, or delete an executor group set as part of a Virtual Warehouse update request. To delete a group set, set the deleteGroupSet option to true for the group set.
Update executor group sets of a Virtual Warehouse

For example:

cdp --profile ${CDP_PROFILE} \
  dw create-vw \
   --cluster-id ${CLUSTER_ID} \
   --dbc-id ${DBC_ID} \
   --vw-type impala \
   --name "impala-$(openssl rand -hex 6)" \
   --template xsmall \
   --impala-ha-settings highAvailabilityMode=ACTIVE_ACTIVE \
   --autoscaling "impalaExecutorGroupSets={small={execGroupSize=1,minExecutorGroups=1,maxExecutorGroups=1,autoSuspendTimeoutSeconds=301,disableAutoSuspend=true,triggerScaleUpDelay=21,triggerScaleDownDelay=21},custom1={execGroupSize=2,minExecutorGroups=0,maxExecutorGroups=1,autoSuspendTimeoutSeconds=302,disableAutoSuspend=true,triggerScaleUpDelay=22,triggerScaleDownDelay=22},custom2={execGroupSize=3,minExecutorGroups=0,maxExecutorGroups=1,autoSuspendTimeoutSeconds=303,disableAutoSuspend=true,triggerScaleUpDelay=23,triggerScaleDownDelay=23},custom3={execGroupSize=4,minExecutorGroups=0,maxExecutorGroups=1,autoSuspendTimeoutSeconds=304,disableAutoSuspend=true,triggerScaleUpDelay=24,triggerScaleDownDelay=24},large={execGroupSize=5,minExecutorGroups=0,maxExecutorGroups=1,autoSuspendTimeoutSeconds=305,disableAutoSuspend=true,triggerScaleUpDelay=25,triggerScaleDownDelay=25}}"

Iceberg drop partition feature🔗

You can easily remove a partition from an Iceberg partition using an alter table statement from Impala. Removing a partition does not affect the table schema. The column is not removed from the schema. This feature is offered on a general availability (GA) basis.

Iceberg branching and tagging GA🔗

In this release, branching and tagging is offered on a general availability (GA) basis. From Hive, you can manage the lifecycle of snapshots using the Iceberg branching and Iceberg tagging features. Branches are references to snapshots that have a lifecycle of their own. Tags identify snapshots you need for auditing and conforming to GDPR. Cloudera recommends that you use this feature in test and development environments. It is not recommended for production deployments.

Enhanced security of Iceberg metadata🔗

In this release, you can prevent overrides of the Iceberg file metadata location by unauthorized users. You accept the default (true), in the Virtual Warehouse to use this feature:

Hive
Configure the property hive.iceberg.allow.datafiles.in.table.location.only.
Impala
Configure the catalogd property iceberg_restrict_data_file_location.

When set to true, all the data files being read must be within the table location; otherwise, an error occurs. When set to false, an unauthorized party who knows the underlying schema and file location outside the table location can rewrite the manifest files within one table location to point to the data files in another table location to read your data. For more information, see "Changing the table metadata location".

Impala support for changing the Iceberg table metadata location🔗

In this release, you can change the Iceberg table metadata location from Impala as well as Hive.

Support for copy-on-write (COW)🔗

Hive supports the copy-on-write (COW) as well as merge-on-read (MOR) for handling Iceberg row-level updates and deletes. You configure COW or MOR based on your use case and rate of data change.

Query metadata tables feature🔗

From Hive and Impala, you can query Iceberg metadata tables as you would query a Hive table. For example, you can use projections, joins, filters, and so on.

Directed distribution mode🔗

This release implements directed distribution mode. The scheduler collects information about which Iceberg data file is scheduled on which host. Since, the scan node for the data files are on the same host as the Iceberg join node, delete files are sent directly to that specific host. This mode can improve V2 table read performance.

SHOW VIEWS statement🔗

This release introduces the SHOW VIEWS statement, which simplifies the task of listing all views within a specified schema or database. Using this command, you can quickly identify and review views, thereby enhancing performance by reducing metadata scan operations.

Planner changes to improve cardinality estimation🔗

Significant changes have been made to the query planner to improve cardinality estimation, a critical component of workload-aware autoscaling.

In previous versions, Impala would generate a plan first and then search for runtime filters based on the entire plan. In this release, selective runtime filters have been integrated. These filters aim to reduce the cardinality estimates of scan nodes and specific join nodes located above them. This refinement occurs after the generation of runtime filters and before the computation of resource requirements.

Distribute runtime filter aggregation🔗

Aggregating runtime filters during runtime can impose significant memory overhead on the coordinator. To address this issue, we initially introduced local aggregation of runtime filters within a single executor node, aiming to alleviate strain on the coordinator by transmitting filter updates only after local aggregation. However, as Impala clusters scale up, the limitations of local filter aggregation become evident, especially in scenarios involving numerous nodes. This places significant memory stress on the coordinator node. To mitigate this challenge, we have implemented a solution that distributes the runtime filter aggregation across specific Impala backends.

Improvement in catalog observability🔗

This release introduces significant enhancements to the Impala Catalog Web UI, focusing on addressing performance issues associated with delays in processing Hive Metastore (HMS) events. These improvements aim to mitigate the risk of queries using outdated metadata.

Caching codegen functions🔗

In Impala, "codegen" involves generating specialized machine code for each query based on query-specific information. When executing a standard query, the query optimizer generates an optimized query plan, which is then passed to the executor for processing. The codegen capability converts query-specific information into machine code, enhancing query performance through faster execution.

Support ORDER BY for collections of variable length types in SELECT list🔗

This release introduces support for collections of variable length types in the sorting tuple. While it's now possible to include these collection columns in the SELECT list alongside other columns used for sorting, direct sorting by these collection columns is not supported. Additionally, collections of variable-length types can now serve as non-passthrough children of UNION ALL nodes.

It is important to note that structs containing collections, whether of variable or fixed length, are still not supported in the select list for ORDER BY queries.

Here are examples of supported queries:

select id, arr_string_1d from collection_tbl order by id;
select id, map_1d from collection_tbl order by id;

However, queries such as the following are not supported:

select id, struct_contains_map from collection_struct_mix order by id;

Attempting to execute such queries will result in the error message "AnalysisException: Sorting is not supported if the select list contains collection(s) nested in struct(s)."