May 5, 2023
This release of the Cloudera Data Warehouse (CDW) service on CDP Public Cloud introduces these changes.
Cloudera Data Warehouse Public Cloud 1.6.3-b319 changes, described in more detail below:
- AWS Kubernetes version 1.24 support
- Selective support for AWS Kubernetes version 1.22 upgrade
- Changes to the standard required IAM permissions and AWS restricted policy
- Synchronized metadata across Impala Virtual Warehouses
- Upgrade and rebuild functionality
- New requirements in Azure to provide managed identity
- Ranger authorization (RAZ) in CDW
- Accessing buckets in a RAZ environment
- Configuring token-based authentication in CDW
- Impala coordinator shutdown in Unified Analytics
- Anti join support in Unified Analytics
- Debugging enhancement: HiveServer OOM dump
- Kubernetes dashboard (Preview)
- Workload Aware Auto-Scaling (Preview)
Cloudera Data Warehouse Public Cloud Runtime 2023.0.14.0-155 changes:
- Iceberg manifest caching
- Iceberg Execute Rollback feature from Impala
- Incremental rebuild of Iceberg materialized views
- Query hint for Impala table cardinality
- Improvement in compaction and Impala metadata refresh
- Optimize Refresh/Invalidate event processing
- Support for custom hash schema for Kudu range tables
- New shell option "hs2_fp_format"
- Support for complex types
- Ability to view Impala query details in Hue
- Ability to access S3 buckets from Hue with RAZ is GA
- Ability to access ADLS Gen2 containers from Hue with RAZ is GA
- The “enable_queries_list” configuration has been removed from Hue jobbrowser safety valve section
AWS Kubernetes version 1.24 support
New CDW clusters using AWS environments you activate in this release 1.6.3-b319 (released May 5, 2023) of Cloudera Data Warehouse will use Amazon Kubernetes (EKS) version 1.24.
Selective support for AWS Kubernetes version 1.22 upgrade
If you activated your pre-existing AWS environment in CDW version 1.4.2-b118 (released Aug 4, 2022) or later, you can upgrade to EKS 1.22 in this release 1.6.3-b319 (released May 5, 2023). AWS environments activated in earlier versions are not supported for upgrade to EKS 1.22 due to incompatibility of some components with EKS 1.22. To determine the activation version of your pre-existing environment, in the Data Warehouse service, expand Environments. In Environments, search for and locate the environment that you want to view. Click Edit. In Environment Details, you see the CDW version.
Changes to the standard required IAM permissions and AWS restricted policy
- AWS restricted policy
In
"Sid": "gocode",
add"
iam:ListAttachedRolePolicies"
, only if you have a Ranger Authorization (RAZ) environment. - Minimum set of IAM permissions required for reduced
permissions mode
Add "
autoscaling:UpdateAutoScalingGroup
".Add "
iam:DeleteRolePolicy
".Add "
iam:ListAttachedRolePolicies
".
Synchronized metadata across Impala Virtual Warehouses
Using Impala Virtual Warehouses that share a Database Catalog is easier in this CDW release. In past releases, after making changes to data and then refreshing tables or invalidating metadata from your Virtual Warehouse, only the catalog metadata and coordinator metadata for that particular Virtual Warehouse were affected. You had to rerun the commands from each Virtual Warehouse to synchronize metadata across multiple Virtual Warehouses that share a Database Catalog.
This release introduces an enhancement that raises events in the Hive metastore. Catalog daemons process events synchronously across all Virtual Warehouses that share metadata. Metadata is refreshed/invalidated in parallel across all your Virtual Warehouses. You need to run the commands only once in any one of your Virtual Warehouse.
To get this feature, you must upgrade your Virtual Warehouse to this release 1.6.3-b319 that
has runtime 2023.0.14.0-155 (released May 5, 2023). You must also set the Impala catalogd
flagfile key enable_reload_events
to true. Newly created Virtual Warehouses
use Impala version 2023.0.14.0-155, which has this feature enabled by default. For more
information, including how to disable this feature, see Disabling metadata synchronization.
This enhancement does not synchronize metadata when you refresh tables or invalidate metadata from a Data Hub cluster.
Upgrade and rebuild functionality
The generic Database Catalog and Virtual Warehouse version upgrade process has changed. When you click the Upgrade button to upgrade your Virtual Warehouse or Database Catalog version, the upgrade that occurs includes the rebuild functionality, which cleans up and redeploys your workload. For more information, Rebuilding a Database Catalog and Rebuilding a Virtual Warehouse.
New requirements in Azure to provide managed identity
During the Azure environment activation and private AKS activation, you must provide the resource ID of a user-assigned managed identity to create the Azure Kubernetes cluster resource.
Ranger authorization (RAZ) in CDW
In this release, if your Data Lake enables Ranger Authorization (RAZ), CDW will be RAZ-enabled to provide authentication in Hue, mainly. If you have an AWS environment that predates the option to enable RAZ for Cloudera Data Warehouse, you might want to manually enable RAZ in CDW.
Accessing buckets in a RAZ environment
In an environment that enables Ranger Authorization (RAZ), you must configure permissions to access an S3 bucket. "Accessing buckets in a RAZ environment" describes how to configure the permissions depending on the AWS account that owns the bucket.
Configuring token-based authentication in CDW
Using a JSON web token (JWT), your Virtual Warehouse client user can sign on to your Virtual Warehouse for a period of time instead of entering single-sign on (SSO) credentials every time your user wants to run a query. You use Impyla to access an Impala Virtual Warehouse and a JDBC connection to access a Hive Virtual Warehouse. For more information, see "Configuring token-based authentication".
Impala coordinator shutdown in Unified Analytics
Cloudera Data Warehouse has supported Impala coordinator shutdown for some time, but not in Unified Analytics until this release. When you create an Impala Virtual Warehouse, and enable Unified Analytics, you can configure Impala coordinator shutdown in Unified Analytics.
Anti join support in Unified Analytics
In this release, Unified Analytics enables an optimization for converting a join having a null filter to perform a SQL anti join.
Debugging enhancement: HiveServer OOM dump
In this release, when a hive-server pod out-of-memory (OOM) event occurs, a corresponding dump will be available in the HiveServer dumps directory:
<external-bucket>/clusters/<env-id>/<dbc-id>/warehouse/debug-artifacts/hive/<compute-id>/hive-dumps/<node-name>/heapdump/dump.hprof
sharecdwdev-bucket-7cv9-dwx-external/clusters/env-9d7cv9/warehouse-1675918732-q9m2/warehouse/debug-artifacts/hive/compute-1675918927-q8hs/hive-dumps/ip-192-168-223-136.us-west-2.compute.internal/heapdump/hiveserver2-oom-dump.hprof
In the example above, ip-192-168-223-136.us-west-2.compute.internal is the name of the
HiveServer node on which the pod was running.Kubernetes dashboard (Preview)
A technical preview of the K8S dashboard is available in this release. You can view the state of resources, such as CPU and memory usage, see the status of pods, and download logs. The dashboard can provide insights into the performance and health of a CDW cluster.
Workload Aware Auto-Scaling (Preview)
Workload Aware Auto-Scaling (WAAS) is available as a technical preview. WAAS allocates Impala Virtual Warehouse resources based on the workload that is running. You choose an executor group set, analogous to a range of nodes instead of the fixed size of the previous auto-scaling implementation. Configuring WAAS requires an entitlement. Contact your account team if you are interested in previewing the feature.
Iceberg manifest caching
Apache Iceberg provides a mechanism to cache the contents of Iceberg manifest files in memory. The manifest caching feature helps to reduce repeated reads of small Iceberg manifest files by Impala Coordinators and Catalogd. You configure manifest caching in CDW. You can enable or disable caching and set a few other caching properties.
Iceberg Execute Rollback feature from Impala
In the event of a problem with your table, you can reset a table to a good state as long as the snapshot of the good table is available. You can roll back the table data based on a snapshot id or a timestamp. The Describe History feature output includes information that helps you use snapshot-related features.
Incremental rebuild of Iceberg materialized views
In this release, the materialized view is automatically updated under certain conditions. After inserting data into a source table of a materialized view, query rewriting is automatically enabled. An incremental rebuild, which updates only the changed part of the view, occurs. Incremental rebuilds tend to update the view much faster than full rebuilds.
Query hint for Impala table cardinality
This release supports a new TABLE_NUM_ROWS query hint to specify a table cardinality for cases where statistics are missing or invalid.
Improvement in compaction and Impala metadata refresh
In this release, handling of the COMMIT_COMPACTION_EVENT has been improved. During compaction, HMS raises events for ACID tables. Impala metadata is refreshed automatically.
Optimize Refresh/Invalidate event processing
Before this release, some metadata consistency issues led to query failures. These failures happen because the metadata updates from multiple coordinators cannot differentiate between self-generated events and those that are generated by a different coordinator. This issue is resolved in this release by adding a coordinator flag to each event, and when processing these events, the coordinator flag is checked to decide whether to ignore the event or consider it.
Support for custom hash schema for Kudu range tables
Impala now includes CREATE TABLE
and ALTER TABLE
syntax
to allow a Kudu custom hash schema. HASH syntax within a partition is similar to the
table-level syntax except that HASH clauses must follow the PARTITION clause, and commas are
not allowed within a partition. You can use the SHOW HASH SCHEMA
statement
to view the hash schema information for each partition.
CREATE TABLE t1 (id int, c2 int, PRIMARY KEY(id, c2))
PARTITION BY HASH(id) PARTITIONS 3 HASH(c2) PARTITIONS 4
RANGE (c2)
(
PARTITION 0 <= VALUES < 10
PARTITION 10 <= VALUES < 20
HASH(id) PARTITIONS 2 HASH(c2) PARTITIONS 3
PARTITION 20 <= VALUES < 30
)
STORED AS KUDU;
ALTER TABLE t1 ADD RANGE PARTITION 30 <= VALUES < 40
HASH(id) PARTITIONS 3 HASH(c2) PARTITIONS 4;
New shell option "hs2_fp_format"
This new option sets the printing format specification for floating point values when using HS2 protocol. The default behavior makes the values handled by Python's str() built-in method. Use '16G' to match Beeswax protocol's floating-point output format.
Support for complex types
A SELECT *
statement did not expand to complex types to be compatible with
earlier versions of Impala that did not support complex types in the result set. In the
older versions of Impala, queries using SELECT *
skip complex types
by default and only expanded to primitive types even when the table contained complex-typed
columns. This release adds a new query option EXPAND_COMPLEX_TYPES
to
include complex types in the SELECT *
list.
Ability to view Impala query details in Hue
You can now view Impala query details, query plan, execution summary, and query metrics on the new Impala Queries tab on the Job Browser page using Hue in CDW, and use this information to tune and optimize your queries. You can also compare two queries and view all query details side-by-side. For more information, see Viewing Impala query details.
Ability to access S3 buckets from Hue with RAZ is GA
Granting fine-grained access to the per-user home directories in Amazon S3 and accessing them from the S3 File Browser in Hue using RAZ is GA. See Accessing S3 bucket from Hue in CDW with RAZ.
Ability to access ADLS Gen2 containers from Hue with RAZ is GA
Granting fine-grained access to the per-user home directories on ADLS Gen2 containers and accessing them from the ABFS File Browser in Hue using RAZ is GA. See Accessing ADLS Gen2 containers from Hue in CDW with RAZ.
The “enable_queries_list” configuration has been removed from Hue jobbrowser safety valve section
The enable_queries_list
configuration in the CDW Hue safety valve was used
to display or hide the Queries tab on the Job
Browser page. This configuration has been removed. The
Queries tab is displayed by default based on the type of Virtual
Warehouse, whether it is Hive or Impala. You can override the query_store
configuration and hide the Queries tab. For more information, see Hue configurations in Cloudera Data Warehouse.