Preparing and defining Cloudera Lakehouse Optimizer policies using REST APIs

Before you create or define the Cloudera Lakehouse Optimizer policies using REST APIs, the administrator must complete the necessary prerequisites. After defining the policy, the administrator must onboard the required namespace, associate the required tables to the policy, and reschedule the namespace.

Consider the best practices before you create or define the policy.

The following section explains the prerequisites, defining the policy, and post-requisites that you must complete using REST APIs for Cloudera Lakehouse Optimizer to initiate table maintenance.

Ensure that you do not have any other optimization services enabled alongside Cloudera Lakehouse Optimizer, such as AWS S3 Table Optimization, to avoid conflicts during the optimization process.
The CDP Environment administrator must perform the following steps to install and configure Cloudera Lakehouse Optimizer:
1. Contact your Cloudera account team to enable the Cloudera Lakehouse Optimizer service for your Cloudera Open Data Lakehouse environment.
2. Ensure that the environment has the following minimum configuration to support Cloudera Lakehouse Optimizer:
  - The AWS environment must have 1x m5.4xLarge (Master Node); 2x r5d.xLarge (Worker Nodes); r5d.xLarge (Compute Nodes - 0 by default).
  - The Azure environment must have 1x Standard_D16d_v5 (Master Node)- 2x Standard_E8ds_v5 (Worker Nodes)- 0x Standard_E8ds_v5 (Compute Nodes - 0 by default).
  important
  Enable autoscaling on the compute node before you use Cloudera Lakehouse Optimizer.
3. Provision only one Cloudera Lakehouse Optimizer Data Hub for your AWS or Azure environment. For instructions, see Provisioning the Lakehouse Optimizer Data Hub.
4. Assign roles to Cloudera Lakehouse Optimizer users. For instructions, see Configuring roles for Lakehouse Optimizer users.
5. Modify the default values for Spark Executor Memory and Spark Driver Memory. The default values are 8 GB and 4 GB respectively. The default memory settings might be enough for a majority of the use cases. However, for heavy workloads you might want to increase these values.
  To modify the spark.driver.memory and spark.executor.memory settings, go to the Cloudera Manager > Clusters > cloudera_lakehouse_optimizer > Configuration > conf/dlm-client.properties_role_safety_valve property.
Cloudera Lakehouse Optimizer administrators must perform the following actions to verify whether the REST APIs are accessible and the service is available:
1. Verify whether you can access Lakehouse Optimizer REST APIs after you generate an Apache Knox token. For more information, see Generating tokens and access Lakehouse Optimizer.
2. Perform a health check, and then verify whether the ClouderaAdaptive default policy is available. For more information, see Verifying Lakehouse Optimizer health and policy script.
Cloudera Lakehouse Optimizer administrators must perform the following actions to initiate the Iceberg table maintenance activity:
1. Onboard a namespace using the PUT /namespaces/{namespace} API. The API informs Cloudera Lakehouse Optimizer to include the associated tables in the namespace for maintenance.
2. Define or create the policy. For more information, see Defining Lakehouse Optimizer resources.
3. Associate or subscribe the tables in the onboarded namespace to the required policy using the POST /policies/{policyName}/tables/{tableName}/subs API. The action modifies or appends the mapping of the Iceberg tables to the policies in the association file.
  You must run the PUT /policies/{policyName}/tables/{tableName}/subs API for subsequent associations. For more information, see Understanding table-policy associations.
  note
  A table can be associated with a set of policies using the dlm_policies table property. For example, ALTER TABLE t1 SET TBLPROPERTIES ('dlm_policies' = 'p1'); where p1 is the policy name. After you update the table properties, run a manual evaluation on the IcebergPropertiesTable policy. You must then run the POST /policies/IcebergPropertiesTable/tables/t1/evaluation API, and reschedule the namespace.
4. Reschedule the namespace to update the policies for all the tables in an existing namespace using the PATCH /namespaces/{namespace} API.
5. Optionally, dry run the policy, during which Cloudera Lakehouse Optimizer only generates the table maintenance actions and does not initiate the maintenance actions. For instructions, see Step 2 of Performing manual table maintenance.

Wait for Cloudera Lakehouse Optimizer to initiate the table maintenance based on the CRON schedule in the policies.

Optionally, you can also initiate a manual table maintenance, when required. Ensure that you dry run the policy before manual maintenance. For more information, see Performing manual table maintenance.

To monitor the policies, use one of the following methods:

View the policy run details or the chosen tables’ maintenance details on the Policies tab or Tables tab respectively in the Lakehouse Optimizer UI.
Monitor the policy jobs as Spark jobs on the Cloudera Observability dashboard. For more information, see Monitoring table maintenance tasks on Cloudera Observability dashboard.
View the various methods that are available to manage and monitor the table maintenance tasks.