Cloudera Lakehouse Optimizer features

Cloudera Lakehouse Optimizer supports several features.

The following table lists the supported features and the supported methods to use these features:

Table 1. Supported feature list
Features Available methods to use the feature Description
Event-based policy REST API – at table, namespace, and catalog level Schedules the policies to be evaluated when an HMS event is triggered, such as an insert, update or delete operation on the table.

You can create only one version of the policy definition at the catalog level in the UI. However, you can create multiple versions of the policy definition at catalog, namespace, or table level using REST APIs.

For example, when you create policy P1 in the UI, the definition is defined at the catalog level. However, using REST APIs you can create another definition for P1 at namespace level or table level.

Schedule-based policy REST API – at table, namespace, and catalog level Schedules policies to be evaluated at regular intervals.

You can create only one version of the policy definition at the catalog level in UI. However, you can create multiple versions of the policy definition at catalog, namespace, or table level using REST APIs.

For more information, see Creating a policy in Lakehouse Optimizer UI and Defining Cloudera Lakehouse Optimizer resources using REST APIs.

Manual (ad hoc) evaluation REST API Manually run the policies to optimize the Iceberg tables when required.

For more information, see Performing manual Iceberg table maintenance using Cloudera Lakehouse Optimizer REST APIs.

Dry-run policies REST API Dry run existing policies to ensure they run effectively without failure. Generates the table maintenance actions but does not initiate any maintenance actions.

For more information, see Preparing and defining policies using REST APIs.

Small file compaction options include:
  • Target file size
  • Minimum number of input files
  • Delete file threshold
  • Maximum concurrent file group rewrites
  • Enable partial progress
  • Maximum number of commits during partial progress
  • Use starting sequence number of snapshot
  • Rewrite all
REST API Automates the Iceberg data file compaction maintenance actions.

In Apache Iceberg documentation, this procedure is called rewrite_data_files, and it supports Table, Strategy (binpack or sort), sort_order (zorder, sortDirection, NullOrder), options, and where arguments which are also supported by Cloudera Lakehouse Optimizer.

Orphan file removal includes:
  • Delete older than
REST API Automates the Iceberg orphan file removal maintenance action.
Snapshot expiration options include:
  • Maximum snapshot age
  • Retain last
  • Expire snapshot ID
  • Clean expired files
REST API Automates the Iceberg snapshot management maintenance actions.
Rewrite manifest options include:
  • Target file size
  • Use caching
REST API Automates the Iceberg manifest rewrite maintenance actions.
Positional delete rewrite options include:
  • Rewrite job order
  • Enable partial progress
  • Maximum number of commits during partial progress
  • Minimum number of input files
  • Maximum concurrent group rewrites
  • Target file size
REST API Automates the Iceberg positional delete rewrite maintenance actions.
Pause and resume table maintenance manually REST API Pauses table maintenance.
The table maintenance is paused in the following scenarios:
  • You manually paused the table maintenance.
  • The recurring failures, during the execution phase of the policy, exceeded the retry value.

For more information, see Pausing and resuming table maintenance.

CLO event logging REST API Ingests the maintenance task metadata, also called an event, into the sys.clo_events Iceberg table. You can use the table to analyze the event logs, use it for troubleshooting purposes and for root cause analysis, and to generate reports.

For more information, see Viewing logs for Cloudera Lakehouse Optimizer.

Monitoring policy jobs REST API Monitor the policy jobs using one of the following methods:
  • View the latest status for the recent tasks that ran for the table or policy on the UI.
  • Use the GET /tasks or GET /tasks/id/{id} APIs.
  • Monitor the Spark jobs on the Cloudera Consumption dashboard.

For more information, see Viewing table maintenance status and Monitoring table maintenance tasks on Cloudera Observability dashboard.

Backup policies and association REST API

Backs up all the existing policies and associations to a TAR file. You can use the backup file to restore these configurations to any other Cloudera Lakehouse Optimizer service instance, when required.

This feature is useful when deleting a current Cloudera Lakehouse Optimizer service instance.

For more information, see Cloudera Lakehouse Optimizer REST APIs.

Fine-grained access to namespaces Ranger UI Creates Ranger policies and provides the required access to groups or users at namespace level.