What's new in Cloudera Data Warehouse on Public Cloud
This release of the Cloudera Data Warehouse (CDW) service on CDP Public Cloud introduces these changes.
- AWS Elastic Kubernetes Service 1.28 support
- Accessing S3 buckets in an AWS environment
- Database Catalog size configuration
- GA availability of a private CDW environment in Azure Kubernetes Service
- IAM policies moved from documentation to a public Github repository
- Metrics for monitoring and troubleshooting Impala Virtual Warehouses
- CDP CLI commands for creating and updating an AWS and Azure cluster
- CDP CLI commands for configuring Workload Aware Auto-Scaling
Cloudera Data Warehouse Runtime 2024.0.17.0-73 changed, described in more detail below:
Iceberg
- Iceberg drop partition feature
- Iceberg branching and tagging GA
- Enhanced security of Iceberg metadata
- Impala support for changing the Iceberg table metadata location
- Support for copy-on-write (COW)
- Query metadata tables feature
- Directed distribution mode
Impala
- SHOW VIEWS statement
- Planner changes to improve cardinality estimation
- Distribute runtime filter aggregation
- Improvement in catalog observability
- Caching codegen functions
AWS Elastic Kubernetes Service 1.28 support
New CDW clusters using AWS environments you activate in this release 1.8.4-b35 (released February, 2024) of CDW will use Amazon Kubernetes (EKS) version 1.28. For more information, see Upgrading Amazon Kubernetes Service.
Accessing S3 buckets in an AWS environment
You can use the CDW UI for configuring access to S3 buckets under certain conditions. You can also configure your own custom encryption key for read/write access from CDW Public Cloud on AWS to the external S3 bucket.
For more information, see Accessing S3 buckets.
Database Catalog size configuration
Using the following CDP CLI commands, you can configure the Java heap for your Database Catalog to small 8Gb (default), medium 16Gb, or large 24 Gb:
Create a default Database Catalog that configures, for example, a large Java Heap size.
dw create-dbc --cluster-id=env-sb42vs --name=SSlarge --memory-tshirt-size=large
Create a Database Catalog that accepts the default Java Heap size, which is small, by not configuring the size.
dw create-dbc --cluster-id=env-sb42vs --name=SSdefault
dw list-dbcs --cluster-id=env-sb42vs
Using the CDW UI as described in Creating a Database Catalog, you can configure the Database Catalog size.
To avoid unnecessary cloud expenses, do not increase the size unless you experience Java heap issues.
GA availability of a private CDW environment in Azure Kubernetes Service
You can now enable a private CDW environment in Azure Kubernetes Service (AKS). You use the Private CDW option in the CLI, and CDW deploys an Azure Kubernetes Cluster with only private endpoints enabled. The cluster can be accessed only from your Azure network.
IAM policies moved from documentation to a public Github repository
To meet customer requests for tracking changes to IAM policies for CDW, the policies now reside in Github. CDW documentation, such as "Attaching the policy to your cross-account role" provides links to the policies.
Metrics for monitoring and troubleshooting Impala Virtual Warehouses
Additional global metrics and table level event metrics are available for debugging an Impala Virtual Warehouse.
CDP CLI commands for creating and updating an AWS and Azure cluster
You can create and update a CDW cluster in Amazon and Azure environments using the following commands:
- create-aws-cluster --environment-crn <value> [options]
- create-azure-cluster --environment-crn <value> --user-assigned-managed-identity <value> [options]
- update-aws-cluster [options]
- update-azure-cluster [options]
For more information about using the commands, including options, see the CDP CLI reference.
CDP CLI commands for configuring Workload Aware Auto-Scaling
- Create a Virtual Warehouse with executor group sets
- Choose how many executor group sets to configure
- Configure each executor group set.
Modify, add, or delete an executor group set as part of a Virtual Warehouse update request. To delete a group set, set the deleteGroupSet option to true for the group set.
- Update executor group sets of a Virtual Warehouse
cdp --profile ${CDP_PROFILE} \
dw create-vw \
--cluster-id ${CLUSTER_ID} \
--dbc-id ${DBC_ID} \
--vw-type impala \
--name "impala-$(openssl rand -hex 6)" \
--template xsmall \
--impala-ha-settings highAvailabilityMode=ACTIVE_ACTIVE \
--autoscaling "impalaExecutorGroupSets={small={execGroupSize=1,minExecutorGroups=1,maxExecutorGroups=1,autoSuspendTimeoutSeconds=301,disableAutoSuspend=true,triggerScaleUpDelay=21,triggerScaleDownDelay=21},custom1={execGroupSize=2,minExecutorGroups=0,maxExecutorGroups=1,autoSuspendTimeoutSeconds=302,disableAutoSuspend=true,triggerScaleUpDelay=22,triggerScaleDownDelay=22},custom2={execGroupSize=3,minExecutorGroups=0,maxExecutorGroups=1,autoSuspendTimeoutSeconds=303,disableAutoSuspend=true,triggerScaleUpDelay=23,triggerScaleDownDelay=23},custom3={execGroupSize=4,minExecutorGroups=0,maxExecutorGroups=1,autoSuspendTimeoutSeconds=304,disableAutoSuspend=true,triggerScaleUpDelay=24,triggerScaleDownDelay=24},large={execGroupSize=5,minExecutorGroups=0,maxExecutorGroups=1,autoSuspendTimeoutSeconds=305,disableAutoSuspend=true,triggerScaleUpDelay=25,triggerScaleDownDelay=25}}"
Iceberg drop partition feature
You can easily remove a partition from an Iceberg partition using an alter table statement from Impala. Removing a partition does not affect the table schema. The column is not removed from the schema. This feature is offered on a general availability (GA) basis.
Iceberg branching and tagging GA
In this release, branching and tagging is offered on a general availability (GA) basis. From Hive, you can manage the lifecycle of snapshots using the Iceberg branching and Iceberg tagging features. Branches are references to snapshots that have a lifecycle of their own. Tags identify snapshots you need for auditing and conforming to GDPR. Cloudera recommends that you use this feature in test and development environments. It is not recommended for production deployments.
Enhanced security of Iceberg metadata
- Hive
Configure the property hive.iceberg.allow.datafiles.in.table.location.only.
- Impala
Configure the catalogd property iceberg_restrict_data_file_location.
When set to true, all the data files being read must be within the table location; otherwise, an error occurs. When set to false, an unauthorized party who knows the underlying schema and file location outside the table location can rewrite the manifest files within one table location to point to the data files in another table location to read your data. For more information, see "Changing the table metadata location".
Impala support for changing the Iceberg table metadata location
In this release, you can change the Iceberg table metadata location from Impala as well as Hive.
Support for copy-on-write (COW)
Hive supports the copy-on-write (COW) as well as merge-on-read (MOR) for handling Iceberg row-level updates and deletes. You configure COW or MOR based on your use case and rate of data change.
Query metadata tables feature
From Hive and Impala, you can query Iceberg metadata tables as you would query a Hive table. For example, you can use projections, joins, filters, and so on.
Directed distribution mode
This release implements directed distribution mode. The scheduler collects information about which Iceberg data file is scheduled on which host. Since, the scan node for the data files are on the same host as the Iceberg join node, delete files are sent directly to that specific host. This mode can improve V2 table read performance.
SHOW VIEWS statement
This release introduces the SHOW VIEWS statement, which simplifies the task of listing all views within a specified schema or database. Using this command, you can quickly identify and review views, thereby enhancing performance by reducing metadata scan operations.
Planner changes to improve cardinality estimation
Significant changes have been made to the query planner to improve cardinality estimation, a critical component of workload-aware autoscaling.
In previous versions, Impala would generate a plan first and then search for runtime filters based on the entire plan. In this release, selective runtime filters have been integrated. These filters aim to reduce the cardinality estimates of scan nodes and specific join nodes located above them. This refinement occurs after the generation of runtime filters and before the computation of resource requirements.
Distribute runtime filter aggregation
Aggregating runtime filters during runtime can impose significant memory overhead on the coordinator. To address this issue, we initially introduced local aggregation of runtime filters within a single executor node, aiming to alleviate strain on the coordinator by transmitting filter updates only after local aggregation. However, as Impala clusters scale up, the limitations of local filter aggregation become evident, especially in scenarios involving numerous nodes. This places significant memory stress on the coordinator node. To mitigate this challenge, we have implemented a solution that distributes the runtime filter aggregation across specific Impala backends.
Improvement in catalog observability
This release introduces significant enhancements to the Impala Catalog Web UI, focusing on addressing performance issues associated with delays in processing Hive Metastore (HMS) events. These improvements aim to mitigate the risk of queries using outdated metadata.
Caching codegen functions
In Impala, "codegen" involves generating specialized machine code for each query based on query-specific information. When executing a standard query, the query optimizer generates an optimized query plan, which is then passed to the executor for processing. The codegen capability converts query-specific information into machine code, enhancing query performance through faster execution.
Support ORDER BY for collections of variable length types in SELECT list
This release introduces support for collections of variable length types in the sorting tuple. While it's now possible to include these collection columns in the SELECT list alongside other columns used for sorting, direct sorting by these collection columns is not supported. Additionally, collections of variable-length types can now serve as non-passthrough children of UNION ALL nodes.
It is important to note that structs containing collections, whether of variable or fixed length, are still not supported in the select list for ORDER BY queries.
Here are examples of supported queries:
select id, arr_string_1d from collection_tbl order by id;
select id, map_1d from collection_tbl order by id;
However, queries such as the following are not supported:
select id, struct_contains_map from collection_struct_mix order by id;
Attempting
to execute such queries will result in the error message
"AnalysisException: Sorting is not supported if the select list
contains collection(s) nested in struct(s)."