Why use Cloudera Lakehouse Optimizer
Using Cloudera Lakehouse Optimizer to maintain the Iceberg tables has several advantages. Learn about a use case that demonstrates how Cloudera Lakehouse Optimizer table maintenance helps data practitioners, and also learn about the various characteristics of this service.
Use case
Data practitioners, including data engineers and data architects, depending on their use cases, ingest the data using streaming or batch methods into Iceberg tables. They then process the tables as necessary which might include row-level mutations using the insert, update, delete, and merge operations.
Over time, these tasks create numerous data files, delete files, snapshots, metadata files, and orphan files. This leads to increased storage costs and degrading performance as the query engines must open, read, and close an increasing number of files, eventually requiring intervention to run optimizations on these tables.
In these circumstances, Data Hub administrators and Cloudera Data Warehouse administrators can use Cloudera Lakehouse Optimizer to optimize the Iceberg tables.
Using Cloudera Lakehouse Optimizer policies, they can automate several table maintenance tasks to gain the following benefits:
- Improving the query performance.
- Reducing cloud resources spending.
- Freeing up the data practitioners workloads, when complete, to use it for other value-add activities.
Characteristics
- Centralized approach – You can identify all the Iceberg tables that you want to maintain, and then add them at catalog, namespace, or table level to one or more Cloudera Lakehouse Optimizer polices to start the table maintenance tasks.
- Engine-agnostic – Cloudera Iceberg supports Spark, Hive, Impala, and Trino engines. Regardless of the engine used in your environment, you can leverage the policies for table maintenance.
- Self-driving data optimization – You can choose to maintain the tables based on HMS events or run the policies at scheduled intervals. After you add the tables to the policies, no manual intervention is required, unless you want to pause table maintenance before an environment maintenance activity, any upgrade tasks, or similar activities.
- Data Lifecycle management service – You can use Cloudera Lakehouse Optimizer as part of the broader Iceberg data lifecycle management framework for table maintenance.
