August 30, 2023
This release of the Cloudera Data Warehouse (CDW) service on CDP Public Cloud introduces these changes.
Cloudera Data Warehouse Public Cloud 1.7.1-b755 changes, described in more detail below:
- Amazon Machine Image updates
- AWS Elastic Kubernetes Service 1.26 support
- Amazon EKS 1.23 and from EKS 1.23 to 1.24 upgrades
- AWS restricted policy updates
- Automatic backup and restoration of Hue
- Correcting the Virtual Warehouse Size
- Deprecation of DAS
- Instance types available for compute nodes
- Java tool options now configurable with limitations
- Managing high partition loads
- Private Cluster control removed from UI
- Query results cache support
Cloudera Data Warehouse Public Cloud Runtime 2023.0.15.0-243 changes:
- Impala support for complex types
- Support ORDER BY for collections of fixed length types in SELECT list
- Support collections of fixed length types as non-passthrough children of unions
- Allow implicit casts between numeric and string types when inserting into table
- Improved memory estimation for aggregates and pre aggregates
- Improved CPU costing
- Added processing cost to control scan parallelism
- Impala WebUI improvements
- Skip reloading file metadata for some ALTER_TABLE events
- JWT auth for Impala
- Native High Availability (HA) for Impala Catalog Service
- Codegen for STRUCT type
Amazon Machine Image updates
This release supports dynamically updating the Amazon Machine Image (AMI) to prevent potential problems running workloads on an old AMI. You can update the AMI of the Cloudformation stack while keeping the current Elastic Kubernetes Service (EKS) version.
AWS Elastic Kubernetes Service 1.26 support
New CDW clusters using AWS environments you activate in this release 1.7.1-b755 (released August 30, 2023) of Cloudera Data Warehouse will use Amazon Kubernetes (EKS) version 1.26. For more information, see Upgrading Amazon Kubernetes Service.
Amazon EKS 1.23 and from EKS 1.23 to 1.24 upgrades
AWS restricted policy updates
The AWS restricted policy has been updated to conform to AWS file size requirements. The update divides the policy into two parts as described in the documentation.
Automatic backup and restoration of Hue
Correcting the Virtual Warehouse Size
After creating an Impala Virtual Warehouse, you can tune, or correct the T-shirt size of executor groups that drive the Impala Virtual Warehouse. The size of the executor groups is critical for achieving cost and performance goals.
Deprecation of DAS
Hue now replaces Data Analytics Studio (DAS). DAS has been deprecated and is no longer available in CDW Public Cloud. DAS features to support Hive and Tez such as running queries, defining HPL/SQL, the Job Browser, query explorer, query compare, and more, have been migrated to Hue, and the Hue Query Processor. After you upgrade to this release, you will not see the option to launch DAS from your Virtual Warehouse. Cloudera recommends you use Hue for all use cases where you might have previously used DAS.
Instance types available for compute nodes
In this release, when you activate an AWS environment, you can select the compute instance type you want to use. This release adds additional instance types you can select when you activate an Azure environment.
Java tool options now configurable with limitations
After creating an Impala Virtual Warehouse, you can change the XMX Java tool option.
Managing high partition loads
In this release, you can identify an error related to high partition workloads and tune your Hive Virtual Warehouse to run successfully.
Private Cluster control removed from UI
Enable Private Cluster has been removed from the environment activation dialog. Use CDP CLI for advanced configurations.
Query results cache support
Unified Analytics now supports the caching of Hive/Impala query results. Caching results of repetitive queries can reduce the load.
In this release, when you generate column statistics in a Hive Virtual Warehouse in Unified
Analytics mode, you can create histogram statistics on columns. By default,
the histogram stats are not created. You enable generation of histogram statistics by
setting a Hive property:
set hive.stats.kll.enable = true;
ANALYZE TABLE [table_name] COMPUTE STATISTICS for COLUMNS [comma_separated_column_list];
Histogram statistics are supported for numeric data types, date, timestamp and boolean types but not for string/varchar/char columns. Histograms are used to estimate selectivity of range predicates (predicates involving <, <=, >, >= and BETWEEN). The better selectivity estimate allows the optimizer to generate more optimal query plans and improve performance for such queries.
Hive to Iceberg table migration from Impala
In this release, you can use Impala, as well as Hive, to migrate a Hive table to Iceberg tables. You use the ALTER TABLE statement. Syntax is described in Migrate Hive table to Iceberg feature and a step-by-step procedure is covered in Migrating a Hive table to Iceberg.
Iceberg position delete feature support
In this release, Impala, in addition to Hive, can delete Iceberg V2 tables using position delete files, a format defined by the Iceberg Spec. A position delete query evaluates rows from one table against a WHERE clause, and delete all the rows that match WHERE conditions.
Impala support for complex types
Complex types are now supported in the SELECT list. Although collections and structs were previously supported, nesting and mixing of complex types was not. For more information, including limitations, see "Allowing embedding complex types into other complex types" in Complex types.
Support ORDER BY for collections of fixed length types in SELECT list
This release supports collections of fixed length types in the sorting tuple. However, you cannot sort by these collection columns, but they can be in the SELECT list along with other column(s) by which you sort.
Support collections of fixed length types as non-passthrough children of unions
This release adds support for collections of fixed length types as non-passthrough children of unions. Plain UNIONs are not supported yet for any collections, but UNION ALL operations are supported.
select id, int_array from complextypestbl union all select cast(id as tinyint), int_array from complextypestbl
Allow implicit casts between numeric and string types when inserting into table
The current implementation requires explicit casts for numeric and string-based literals.
However, this release relaxes the implicit casting rules for these cases. This is
controlled through a query option
allow_unsafe_casts and turned off by
default. This query option allows implicit casting between some numeric types and string
Improved memory estimation for aggregates and pre aggregates
This release introduces new query options to improve memory estimation for aggregation nodes. Also introduces better cardinality estimates to help in capping memory limits early on during query planning.
Improved CPU costing
This release introduces some changes to the query planner to improve parallel sizing and resource estimation. These changes are done for workload-aware autoscaling and will be available as query options. These additional query options are added for tuning purposes. This new functionality will allow more customers to enable multi-threaded queries globally for improved performance.
Added processing cost to control scan parallelism
Before this release, when a user executed a query with
COMPUTE_PROCESSING_COST=1, Impala relied on the
option to decide the degree of parallelism of the scan fragment. This release introduces the
scan node's processing cost as another factor to consider raising scan parallelism beyond
Scan node cost now includes the number of effective scan ranges. Each scan range is given a
weight of (0.5% * min_processing_per_thread), which roughly means that one scan node
instance can handle at most 200 scan ranges. This release also introduces a new query option
MAX_FRAGMENT_INSTANCES_PER_NODE to cap the maximum number of fragment
instances per node. This newly introduced query option works in conjunction with
Impala WebUI improvements
This release enhanced the Impala daemon’s Web UI to display the following additional details:
Backends start time and version: In a large cluster, you can now use the Impala daemon’s Web UI to view the start time and version for all the backends.
Query performance characteristics: For a detailed report on how a query was executed and to understand the detailed performance characteristics of a query, you can use the built-in web server’s UI and look at the timeline shown in the Gantt chart. This chart is an alternative to the PROFILE command and is a graphical display in the WebUI that renders timing information and dependencies.
Export query plan and timeline: To understand the detailed performance characteristics for a query, you issue the PROFILE command in impala-shell immediately after executing a query. As an alternative to the profile download page, this release added support for exporting the graphical query plan and also for downloading the timeline in SVG/HTML format. Once you export the query plan or the timeline, memory resources consumed from the ObjectURLs get cleared.
Historical/in-flight query performance: You can now use the query list and query details page to analyze historical or in-flight query performance by viewing the memory consumed, the amount of data read, and other information about the query.
Aggregate CPU node utilization: You can now see the recent aggregate CPU node utilization samples for the different nodes.
Scaling of timeticks and fragment timing diagram for better accessibility: You can now use the query timeline display to scroll horizontally through the fragment timing diagram and utilization chart. You can also zoom by horizontally scaling through mouse wheel events in addition to increasing/decreasing the precision of timetick values.
Skip reloading file metadata for some ALTER_TABLE events
Before this release, EventProcessor ignored trivial ALTER_TABLE events that only modify tblproperties like "transient_lastDdlTime," "totalSize," "numFilesErasureCoded," and "numFiles". For other non-rename ALTER_TABLE events, it triggered a full refresh on the table, which becomes expensive for tables with a large number of partitions or files.
From this release, to be more cost-efficient, the event processor skips reloading file metadata for some ALTER_TABLE events.
The following list contains the events that skip reloading file metadata:
changing table comment
changing column definition (name/type/comment)
setting customized tblproperties
For interoperability purposes, this release introduces a new start-up flag 'file_metadata_reload_properties' to list the table properties that need the file metadata reloaded when the properties are changed.
JWT auth for Impala
Impala clients, such as Impala shell, can now authenticate to Impala using a JWT instead of a username/password. To connect to Impala using JWT authentication, specify JWT command-line options to the impala-shell command interpreter and enter the password when prompted.
Native High Availability (HA) for Impala Catalog Service
The High Availability (HA) mode of catalog service in CDW reduces the outage duration of the Impala cluster when the primary catalog service fails. Before this release, catalog HA was supported using the K8s leader election mechanism, and now it is natively supported in Impala.
Codegen for STRUCT type
Codegen uses query-specific information to generate specialized machine code for each query. As an Impala user, when you run a standard query, the query optimizer generates an optimized query plan and passes it to the executor for processing. With the codegen capability for STRUCT type in the SELECT list, the query specific information is converted to machine code for faster execution.
Before this release, having structs in the select list was only supported with codegen turned off. This release lifts this restriction, adding full codegen support for structs in the select list.
select small_struct from complextypes_structs