August 30, 2023

This release of the Cloudera Data Warehouse (CDW) service on CDP Public Cloud introduces these changes.

Cloudera Data Warehouse Public Cloud 1.7.1-b755 changes, described in more detail below:

Amazon Machine Image updates
AWS Elastic Kubernetes Service 1.26 support
Amazon EKS 1.23 and from EKS 1.23 to 1.24 upgrades
AWS restricted policy updates
Automatic backup and restoration of Hue
Correcting the Virtual Warehouse Size
Deprecation of DAS
Instance types available for compute nodes
Java tool options now configurable with limitations
Managing high partition loads
Private Cluster control removed from UI
Query results cache support

Cloudera Data Warehouse Public Cloud Runtime 2023.0.15.0-243 changes:

Hive

Histogram statistics

Iceberg

Iceberg position delete feature support
Hive to Iceberg table migration from Impala

Impala

Impala support for complex types
Support ORDER BY for collections of fixed length types in SELECT list
Support collections of fixed length types as non-passthrough children of unions
Allow implicit casts between numeric and string types when inserting into table
Improved memory estimation for aggregates and pre aggregates
Improved CPU costing
Added processing cost to control scan parallelism
Impala WebUI improvements
Skip reloading file metadata for some ALTER_TABLE events
JWT auth for Impala
Native High Availability (HA) for Impala Catalog Service
Codegen for STRUCT type

Amazon Machine Image updates🔗

This release supports dynamically updating the Amazon Machine Image (AMI) to prevent potential problems running workloads on an old AMI. You can update the AMI of the Cloudformation stack while keeping the current Elastic Kubernetes Service (EKS) version.

AWS Elastic Kubernetes Service 1.26 support🔗

New CDW clusters using AWS environments you activate in this release 1.7.1-b755 (released August 30, 2023) of Cloudera Data Warehouse will use Amazon Kubernetes (EKS) version 1.26. For more information, see Upgrading Amazon Kubernetes Service.

Amazon EKS 1.23 and from EKS 1.23 to 1.24 upgrades🔗

The release includes reduced permissions mode support for the following Amazon Elastic Kubernetes Service upgrades:

AWS restricted policy updates🔗

The AWS restricted policy has been updated to conform to AWS file size requirements. The update divides the policy into two parts as described in the documentation.

Automatic backup and restoration of Hue🔗

The CDW backup and restore processes for Hue have been automated. Manually backing up and restoring Hue is still available, but optional.

Correcting the Virtual Warehouse Size🔗

After creating an Impala Virtual Warehouse, you can tune, or correct the T-shirt size of executor groups that drive the Impala Virtual Warehouse. The size of the executor groups is critical for achieving cost and performance goals.

Deprecation of DAS🔗

Hue now replaces Data Analytics Studio (DAS). DAS has been deprecated and is no longer available in CDW Public Cloud. DAS features to support Hive and Tez such as running queries, defining HPL/SQL, the Job Browser, query explorer, query compare, and more, have been migrated to Hue, and the Hue Query Processor. After you upgrade to this release, you will not see the option to launch DAS from your Virtual Warehouse. Cloudera recommends you use Hue for all use cases where you might have previously used DAS.

Instance types available for compute nodes🔗

In this release, when you activate an AWS environment, you can select the compute instance type you want to use. This release adds additional instance types you can select when you activate an Azure environment.

Java tool options now configurable with limitations🔗

After creating an Impala Virtual Warehouse, you can change the XMX Java tool option.

Managing high partition loads🔗

In this release, you can identify an error related to high partition workloads and tune your Hive Virtual Warehouse to run successfully.

Private Cluster control removed from UI🔗

Enable Private Cluster has been removed from the environment activation dialog. Use CDP CLI for advanced configurations.

Query results cache support🔗

Unified Analytics now supports the caching of Hive/Impala query results. Caching results of repetitive queries can reduce the load.

Histogram statistics🔗

In this release, when you generate column statistics in a Hive Virtual Warehouse in Unified Analytics mode, you can create histogram statistics on columns. By default, the histogram stats are not created. You enable generation of histogram statistics by setting a Hive property: set hive.stats.kll.enable = true;

You can then run the ANALYZE command as usual:

ANALYZE TABLE [table_name] COMPUTE STATISTICS for COLUMNS [comma_separated_column_list];

Histogram statistics are supported for numeric data types, date, timestamp and boolean types but not for string/varchar/char columns. Histograms are used to estimate selectivity of range predicates (predicates involving <, <=, >, >= and BETWEEN). The better selectivity estimate allows the optimizer to generate more optimal query plans and improve performance for such queries.

Hive to Iceberg table migration from Impala🔗

In this release, you can use Impala, as well as Hive, to migrate a Hive table to Iceberg tables. You use the ALTER TABLE statement. Syntax is described in Migrate Hive table to Iceberg feature and a step-by-step procedure is covered in Migrating a Hive table to Iceberg.

Iceberg position delete feature support🔗

In this release, Impala, in addition to Hive, can delete Iceberg V2 tables using position delete files, a format defined by the Iceberg Spec. A position delete query evaluates rows from one table against a WHERE clause, and delete all the rows that match WHERE conditions.

Impala support for complex types🔗

Complex types are now supported in the SELECT list. Although collections and structs were previously supported, nesting and mixing of complex types was not. For more information, including limitations, see "Allowing embedding complex types into other complex types" in Complex types.

Support ORDER BY for collections of fixed length types in SELECT list🔗

This release supports collections of fixed length types in the sorting tuple. However, you cannot sort by these collection columns, but they can be in the SELECT list along with other column(s) by which you sort.

Support collections of fixed length types as non-passthrough children of unions🔗

This release adds support for collections of fixed length types as non-passthrough children of unions. Plain UNIONs are not supported yet for any collections, but UNION ALL operations are supported.

Example:

select id, int_array from complextypestbl
union all select cast(id as tinyint), int_array from complextypestbl

Allow implicit casts between numeric and string types when inserting into table🔗

The current implementation requires explicit casts for numeric and string-based literals. However, this release relaxes the implicit casting rules for these cases. This is controlled through a query option allow_unsafe_casts and turned off by default. This query option allows implicit casting between some numeric types and string types.

Improved memory estimation for aggregates and pre aggregates🔗

This release introduces new query options to improve memory estimation for aggregation nodes. Also introduces better cardinality estimates to help in capping memory limits early on during query planning.

Improved CPU costing🔗

This release introduces some changes to the query planner to improve parallel sizing and resource estimation. These changes are done for workload-aware autoscaling and will be available as query options. These additional query options are added for tuning purposes. This new functionality will allow more customers to enable multi-threaded queries globally for improved performance.

Added processing cost to control scan parallelism🔗

Before this release, when a user executed a query with COMPUTE_PROCESSING_COST=1, Impala relied on the MT_DOP option to decide the degree of parallelism of the scan fragment. This release introduces the scan node's processing cost as another factor to consider raising scan parallelism beyond MT_DOP.

Scan node cost now includes the number of effective scan ranges. Each scan range is given a weight of (0.5% * min_processing_per_thread), which roughly means that one scan node instance can handle at most 200 scan ranges. This release also introduces a new query option MAX_FRAGMENT_INSTANCES_PER_NODE to cap the maximum number of fragment instances per node. This newly introduced query option works in conjunction with PROCESSING_COST_MIN_THREADS.

Impala WebUI improvements🔗

This release enhanced the Impala daemon’s Web UI to display the following additional details:

Backends start time and version: In a large cluster, you can now use the Impala daemon’s Web UI to view the start time and version for all the backends.
Query performance characteristics: For a detailed report on how a query was executed and to understand the detailed performance characteristics of a query, you can use the built-in web server’s UI and look at the timeline shown in the Gantt chart. This chart is an alternative to the PROFILE command and is a graphical display in the WebUI that renders timing information and dependencies.
Export query plan and timeline: To understand the detailed performance characteristics for a query, you issue the PROFILE command in impala-shell immediately after executing a query. As an alternative to the profile download page, this release added support for exporting the graphical query plan and also for downloading the timeline in SVG/HTML format. Once you export the query plan or the timeline, memory resources consumed from the ObjectURLs get cleared.
Historical/in-flight query performance: You can now use the query list and query details page to analyze historical or in-flight query performance by viewing the memory consumed, the amount of data read, and other information about the query.
Aggregate CPU node utilization: You can now see the recent aggregate CPU node utilization samples for the different nodes.
Scaling of timeticks and fragment timing diagram for better accessibility: You can now use the query timeline display to scroll horizontally through the fragment timing diagram and utilization chart. You can also zoom by horizontally scaling through mouse wheel events in addition to increasing/decreasing the precision of timetick values.

Skip reloading file metadata for some ALTER_TABLE events🔗

Before this release, EventProcessor ignored trivial ALTER_TABLE events that only modify tblproperties like "transient_lastDdlTime," "totalSize," "numFilesErasureCoded," and "numFiles". For other non-rename ALTER_TABLE events, it triggered a full refresh on the table, which becomes expensive for tables with a large number of partitions or files.

From this release, to be more cost-efficient, the event processor skips reloading file metadata for some ALTER_TABLE events.

The following list contains the events that skip reloading file metadata:

changing table comment
adding/dropping columns
changing column definition (name/type/comment)
changing ownership
setting customized tblproperties

For interoperability purposes, this release introduces a new start-up flag 'file_metadata_reload_properties' to list the table properties that need the file metadata reloaded when the properties are changed.

JWT auth for Impala🔗

Impala clients, such as Impala shell, can now authenticate to Impala using a JWT instead of a username/password. To connect to Impala using JWT authentication, specify JWT command-line options to the impala-shell command interpreter and enter the password when prompted.

Native High Availability (HA) for Impala Catalog Service🔗

The High Availability (HA) mode of catalog service in CDW reduces the outage duration of the Impala cluster when the primary catalog service fails. Before this release, catalog HA was supported using the K8s leader election mechanism, and now it is natively supported in Impala.

Codegen for STRUCT type🔗

Codegen uses query-specific information to generate specialized machine code for each query. As an Impala user, when you run a standard query, the query optimizer generates an optimized query plan and passes it to the executor for processing. With the codegen capability for STRUCT type in the SELECT list, the query specific information is converted to machine code for faster execution.

Before this release, having structs in the select list was only supported with codegen turned off. This release lifts this restriction, adding full codegen support for structs in the select list.