Best practices to use Cloudera Lakehouse Optimizer
Cloudera recommends following a few best practices to ensure error-free Iceberg table maintenance while using Cloudera Lakehouse Optimizer.
The following are the recommended best practices:
-
Enabling autoscaling for compute nodes.
-
Using
ClouderaAdaptiveas a template to create your policies. -
Creating only one policy per table to avoid policy conflict.
-
Using policy names that show the actions the policies run, the tables they apply for, or both. This best practice enhances policy management and governance.
-
Ensuring the Cloudera Lakehouse Optimizer policy file or policy template size does not exceed 10 MB.
- Scheduling policy runs so that they do not interfere with the Iceberg-related replication
tasks.
When you have Iceberg replication policies running on the Iceberg tables that are also slated for Cloudera Lakehouse Optimizer table maintenance, you must ensure that the schedules for the replication policies and Cloudera Lakehouse Optimizer policies do not overlap. This is because these concurrent policy jobs interfere with each other, and one or both of the policies might fail or create issues.
-
Reviewing and cross-checking the arguments and table associations before you use or trigger the orphan file removal, rewrite positional delete, and snapshot expiration maintenance operations. Cloudera recommends this best practice because these operations might delete data.
-
Using a
whereclause to filter the data files when the number of data files to be compacted is very large. Thewhereclause is part of the rewrite data file argument. -
Enabling partial progress for compaction to prevent out-of-memory issues.
-
Performing a dry run on the policies before you trigger manual evaluations.
- Performing manual table maintenance on a static Iceberg table using REST APIs instead of
including it in a Cloudera Lakehouse Optimizer policy.
Static tables are tables that have not undergone any modifications, such as insert, update, delete operations, or schema evolution, since the time of table creation.
-
Using event-based policies to prevent excessive evaluation for tables that are not updated frequently.
