Cloudera Lakehouse Optimizer features

Cloudera Lakehouse Optimizer supports several features. Some features are available only through REST APIs. You can choose to create Cloudera Lakehouse Optimizer policies in the UI or use REST APIs depending on your requirements.

The following table lists the supported features and the supported methods to use these features:

Table 1. Supported feature list
Features Available methods to use the feature Description
Event-based policy
  • UI – at table and namespace level
  • REST API – at table, namespace, and catalog level
Schedules the policies to be evaluated when an HMS event is triggered, such as an insert, update or delete operation on the table.

You can create only one version of the policy definition at the catalog level in the UI. However, you can create multiple versions of the policy definition at catalog, namespace, or table level using REST APIs.

For example, when you create policy P1 in the UI, the definition is defined at the catalog level. However, using REST APIs you can create another definition for P1 at namespace level or table level.

Schedule-based policy
  • UI – at table and namespace level
  • REST API – at table, namespace, and catalog level
Schedules policies to be evaluated at regular intervals.

You can create only one version of the policy definition at the catalog level in UI. However, you can create multiple versions of the policy definition at catalog, namespace, or table level using REST APIs.

For more information, see Creating a Cloudera Lakehouse Optimizer policy in Lakehouse Optimizer UI and Defining Cloudera Lakehouse Optimizer resources using REST APIs.

Manual (ad hoc) evaluation
  • REST API
Runs the policies manually to optimize the Iceberg tables when required.

For more information, see Performing manual Iceberg table maintenance using Cloudera Lakehouse Optimizer REST APIs.

Dry-run policies
  • REST API
Generates the table maintenance actions but does not initiate any maintenance actions. Dry run the policies to ensure they run effectively without failure.

For more information, see Preparing and defining Cloudera Lakehouse Optimizer policies using REST APIs.

Small file compaction options include:
  • Target file size
  • Minimum number of input files
  • Delete file threshold
  • Maximum concurrent file group rewrites
  • Enable partial progress
  • Maximum number of commits during partial progress
  • Use starting sequence number of snapshot
  • Rewrite all
  • UI
  • REST API
Automates the Iceberg data file compaction maintenance actions.

In Apache Iceberg documentation, this procedure is called rewrite_data_files, and it supports Table, Strategy (binpack or sort), sort_order (zorder, sortDirection, NullOrder), options, and where arguments which are also supported by Cloudera Lakehouse Optimizer.

Orphan file removal includes:
  • Delete older than
  • UI
  • REST API
Automates the Iceberg orphan file removal maintenance action.
Snapshot expiration options include:
  • Maximum snapshot age
  • Retain last
  • Expire snapshot ID
  • Clean expired files
  • UI
  • REST API
Automates the Iceberg snapshot management maintenance actions.
Rewrite manifest options include:
  • Target file size
  • Use caching
  • UI
  • REST API
Automates the Iceberg manifest rewrite maintenance actions.
Positional delete rewrite options include:
  • Rewrite job order
  • Enable partial progress
  • Maximum number of commits during partial progress
  • Minimum number of input files
  • Maximum concurrent group rewrites
  • Target file size
  • UI
  • REST API
Automates the Iceberg positional delete rewrite maintenance actions.
Pause and resume table maintenance manually
  • UI
  • REST API
Pauses table maintenance.
The table maintenance is paused in the following scenarios:
  • You manually paused the table maintenance.
  • The recurring failures, during the execution phase of the policy, exceeded the retry value.

For more information, see Pausing and resuming table maintenance.

CLO event logging REST API Ingests the maintenance task metadata, also called an event, into the sys.task_events Iceberg table. You can use the table to analyze the event logs, use it for troubleshooting purposes and for root cause analysis, and to generate reports.

For more information, see Viewing logs for Cloudera Lakehouse Optimizer.

Monitoring policy jobs
  • UI
  • REST API
  • Cloudera Consumption dashboard
Monitor the policy jobs using one of the following methods:
  • View the latest status for the recent tasks that ran for the table or policy on the UI.
  • Use the GET /tasks or GET /tasks/id/{id} APIs.
  • Monitor the Spark jobs on the Cloudera Consumption dashboard.

For more information, see Viewing table maintenance status and Monitoring table maintenance tasks on Cloudera Observability dashboard.

Backup policies and association REST API Backs up all the existing policies and associations to a TAR file. You can restore it to another Data Hub, when required.

You can use this feature when you want to delete the current Cloudera Lakehouse Optimizer Data Hub and provision another Data Hub.

For more information, see Cloudera Lakehouse Optimizer REST APIs.

The following additional features are also available:
  • Viewing the metrics such as data read and data written for each task on the Cloudera Consumption dashboard.
  • Viewing the real-time analysis of the infrastructure, jobs, users, and services for the Data Hub hosting the Cloudera Lakehouse Optimizer service in Cloudera Management Console.