Release NotesPDF version

August 30, 2023

This release of the Cloudera Data Warehouse (CDW) service on CDP Public Cloud introduces these changes.

Cloudera Data Warehouse Public Cloud 1.7.1-b755 changes, described in more detail below:

Cloudera Data Warehouse Public Cloud Runtime 2023.0.15.0-243 changes:

Hive

Iceberg

Impala

This release supports dynamically updating the Amazon Machine Image (AMI) to prevent potential problems running workloads on an old AMI. You can update the AMI of the Cloudformation stack while keeping the current Elastic Kubernetes Service (EKS) version.

New CDW clusters using AWS environments you activate in this release 1.7.1-b755 (released August 30, 2023) of Cloudera Data Warehouse will use Amazon Kubernetes (EKS) version 1.26. For more information, see Upgrading Amazon Kubernetes Service.

The release includes reduced permissions mode support for the following Amazon Elastic Kubernetes Service upgrades:

The AWS restricted policy has been updated to conform to AWS file size requirements. The update divides the policy into two parts as described in the documentation.

The CDW backup and restore processes for Hue have been automated. Manually backing up and restoring Hue is still available, but optional.

After creating an Impala Virtual Warehouse, you can tune, or correct the T-shirt size of executor groups that drive the Impala Virtual Warehouse. The size of the executor groups is critical for achieving cost and performance goals.

Hue now replaces Data Analytics Studio (DAS). DAS has been deprecated and is no longer available in CDW Public Cloud. DAS features to support Hive and Tez such as running queries, defining HPL/SQL, the Job Browser, query explorer, query compare, and more, have been migrated to Hue, and the Hue Query Processor. After you upgrade to this release, you will not see the option to launch DAS from your Virtual Warehouse. Cloudera recommends you use Hue for all use cases where you might have previously used DAS.

In this release, when you activate an AWS environment, you can select the compute instance type you want to use. This release adds additional instance types you can select when you activate an Azure environment.

After creating an Impala Virtual Warehouse, you can change the XMX Java tool option.

In this release, you can identify an error related to high partition workloads and tune your Hive Virtual Warehouse to run successfully.

Enable Private Cluster has been removed from the environment activation dialog. Use CDP CLI for advanced configurations.

Unified Analytics now supports the caching of Hive/Impala query results. Caching results of repetitive queries can reduce the load.

In this release, when you generate column statistics in a Hive Virtual Warehouse in Unified Analytics mode, you can create histogram statistics on columns. By default, the histogram stats are not created. You enable generation of histogram statistics by setting a Hive property: set hive.stats.kll.enable = true;

You can then run the ANALYZE command as usual:
ANALYZE TABLE [table_name] COMPUTE STATISTICS for COLUMNS [comma_separated_column_list];

Histogram statistics are supported for numeric data types, date, timestamp and boolean types but not for string/varchar/char columns. Histograms are used to estimate selectivity of range predicates (predicates involving <, <=, >, >= and BETWEEN). The better selectivity estimate allows the optimizer to generate more optimal query plans and improve performance for such queries.

In this release, you can use Impala, as well as Hive, to migrate a Hive table to Iceberg tables. You use the ALTER TABLE statement. Syntax is described in Migrate Hive table to Iceberg feature and a step-by-step procedure is covered in Migrating a Hive table to Iceberg.

In this release, Impala, in addition to Hive, can delete Iceberg V2 tables using position delete files, a format defined by the Iceberg Spec. A position delete query evaluates rows from one table against a WHERE clause, and delete all the rows that match WHERE conditions.

Complex types are now supported in the SELECT list. Although collections and structs were previously supported, nesting and mixing of complex types was not. For more information, including limitations, see "Allowing embedding complex types into other complex types" in Complex types.

This release supports collections of fixed length types in the sorting tuple. However, you cannot sort by these collection columns, but they can be in the SELECT list along with other column(s) by which you sort.

This release adds support for collections of fixed length types as non-passthrough children of unions. Plain UNIONs are not supported yet for any collections, but UNION ALL operations are supported.

Example:

select id, int_array from complextypestbl
union all select cast(id as tinyint), int_array from complextypestbl

The current implementation requires explicit casts for numeric and string-based literals. However, this release relaxes the implicit casting rules for these cases. This is controlled through a query option allow_unsafe_casts and turned off by default. This query option allows implicit casting between some numeric types and string types.

This release introduces new query options to improve memory estimation for aggregation nodes. Also introduces better cardinality estimates to help in capping memory limits early on during query planning.

This release introduces some changes to the query planner to improve parallel sizing and resource estimation. These changes are done for workload-aware autoscaling and will be available as query options. These additional query options are added for tuning purposes. This new functionality will allow more customers to enable multi-threaded queries globally for improved performance.

Before this release, when a user executed a query with COMPUTE_PROCESSING_COST=1, Impala relied on the MT_DOP option to decide the degree of parallelism of the scan fragment. This release introduces the scan node's processing cost as another factor to consider raising scan parallelism beyond MT_DOP.

Scan node cost now includes the number of effective scan ranges. Each scan range is given a weight of (0.5% * min_processing_per_thread), which roughly means that one scan node instance can handle at most 200 scan ranges. This release also introduces a new query option MAX_FRAGMENT_INSTANCES_PER_NODE to cap the maximum number of fragment instances per node. This newly introduced query option works in conjunction with PROCESSING_COST_MIN_THREADS.

This release enhanced the Impala daemon’s Web UI to display the following additional details:

  • Backends start time and version: In a large cluster, you can now use the Impala daemon’s Web UI to view the start time and version for all the backends.

  • Query performance characteristics: For a detailed report on how a query was executed and to understand the detailed performance characteristics of a query, you can use the built-in web server’s UI and look at the timeline shown in the Gantt chart. This chart is an alternative to the PROFILE command and is a graphical display in the WebUI that renders timing information and dependencies.

  • Export query plan and timeline: To understand the detailed performance characteristics for a query, you issue the PROFILE command in impala-shell immediately after executing a query. As an alternative to the profile download page, this release added support for exporting the graphical query plan and also for downloading the timeline in SVG/HTML format. Once you export the query plan or the timeline, memory resources consumed from the ObjectURLs get cleared.

  • Historical/in-flight query performance: You can now use the query list and query details page to analyze historical or in-flight query performance by viewing the memory consumed, the amount of data read, and other information about the query.

  • Aggregate CPU node utilization: You can now see the recent aggregate CPU node utilization samples for the different nodes.

  • Scaling of timeticks and fragment timing diagram for better accessibility: You can now use the query timeline display to scroll horizontally through the fragment timing diagram and utilization chart. You can also zoom by horizontally scaling through mouse wheel events in addition to increasing/decreasing the precision of timetick values.

Before this release, EventProcessor ignored trivial ALTER_TABLE events that only modify tblproperties like "transient_lastDdlTime," "totalSize," "numFilesErasureCoded," and "numFiles". For other non-rename ALTER_TABLE events, it triggered a full refresh on the table, which becomes expensive for tables with a large number of partitions or files.

From this release, to be more cost-efficient, the event processor skips reloading file metadata for some ALTER_TABLE events.

The following list contains the events that skip reloading file metadata:

  • changing table comment

  • adding/dropping columns

  • changing column definition (name/type/comment)

  • changing ownership

  • setting customized tblproperties

For interoperability purposes, this release introduces a new start-up flag 'file_metadata_reload_properties' to list the table properties that need the file metadata reloaded when the properties are changed.

Impala clients, such as Impala shell, can now authenticate to Impala using a JWT instead of a username/password. To connect to Impala using JWT authentication, specify JWT command-line options to the impala-shell command interpreter and enter the password when prompted.

The High Availability (HA) mode of catalog service in CDW reduces the outage duration of the Impala cluster when the primary catalog service fails. Before this release, catalog HA was supported using the K8s leader election mechanism, and now it is natively supported in Impala.

Codegen uses query-specific information to generate specialized machine code for each query. As an Impala user, when you run a standard query, the query optimizer generates an optimized query plan and passes it to the executor for processing. With the codegen capability for STRUCT type in the SELECT list, the query specific information is converted to machine code for faster execution.

Before this release, having structs in the select list was only supported with codegen turned off. This release lifts this restriction, adding full codegen support for structs in the select list.

Example:

select small_struct from complextypes_structs     

We want your opinion

How can we improve this page?

What kind of feedback do you have?