Preparing and defining Cloudera Lakehouse Optimizer policies using REST APIs

Before you create or define the Cloudera Lakehouse Optimizer policies using REST APIs, the administrator must complete the necessary prerequisites. After defining the policy, the administrator must onboard the required namespace, associate the required tables to the policy, and reschedule the namespace.

Consider the best practices before you create or define the policy.
The following section explains the prerequisites, defining the policy, and post-requisites that you must complete using REST APIs for Cloudera Lakehouse Optimizer to initiate table maintenance.
  1. Contact your Cloudera account team to enable the Cloudera Lakehouse Optimizer service for your Cloudera Open Data Lakehouse environment.
  2. Use an existing Cloudera Base on premises cluster, or create a dedicated Cloudera Lakehouse Optimizer cluster.
    Ensure that the cluster contains the Zookeeper, HDFS, Hive, YARN, Spark3_on_yarn, Oozie, Hue, Livy_for_spark3, Ranger, Knox, and Meteringv2 services.

    Add the Cloudera Lakehouse Optimizer service to the cluster.

  3. Assign roles to Cloudera Lakehouse Optimizer users. For instructions, see Configuring roles for Lakehouse Optimizer users.
  4. If required, enable the Ranger service for Cloudera Lakehouse Optimizer, and then create the Ranger policies to provide the fine-grained access to a user or group. For instructions, see Providing fine-grained access to namespaces using Ranger.
  5. Modify the default values for Spark Executor Memory and Spark Driver Memory. The default values are 8 GB and 4 GB respectively. The default memory settings might be enough for a majority of the use cases. However, for heavy workloads you might want to increase these values. For more information, see Troubleshooting: Configuring the Spark engine to optimize the rewrite compaction job.
    To modify the spark.driver.memory and spark.executor.memory settings, go to the Cloudera Manager > Clusters > [***CLOUDERA LAKEHOUSE OPTIMIZER***] > Configuration > conf/dlm-client.properties_role_safety_valve property.
  6. Optionally, configure custom YARN queues to use for Cloudera Lakehouse Optimizer workloads. This ensures that the policy runs do not overlap with other workloads.
  7. For all node groups, configure the yarn.scheduler.maximum-allocation-mb and yarn.nodemanager.resource.memory-mb advanced configuration snippets on the Cloudera Manager > Clusters > YARN service > Configuration tab to more than 8 GB depending on your requirements.
  8. By default, the dlm_admin, dlm_operator, or dlm_monitor groups are available that are assigned to the Cloudera Lakehouse Optimizer administrator, operator, or monitor role respectively. You can add one or more comma-separated groups to perform the administrator, operator, or monitor roles.
    To view the default groups or to add more groups, perform the following steps:
    1. Go to the Cloudera Manager > Clusters > [***CLOUDERA LAKEHOUSE OPTIMIZER***] > Configuration tab.
    2. For the administrator role, search for the DLM Security Role Admin property. By default, the property contains the dlm_admin group. You can append more groups that are comma separated. For example, dlm_admin, clo_admin.
    3. For the operator role, search for the DLM Security Role Operator property. By default, the property contains the dlm_operator group. You can append more groups that are comma separated.

      For the monitor role, search for the DLM Security Role Monitor property. By default, the property contains the dlm_monitor group. You can append more groups that are comma separated.

  9. Cloudera Lakehouse Optimizer administrators must perform the following actions to verify whether the REST APIs are accessible and the service is available:
    1. Verify whether you can access Lakehouse Optimizer REST APIs after you generate an Apache Knox token. For more information, see Generating tokens and access Lakehouse Optimizer.
    2. Perform a health check, and then verify whether the ClouderaAdaptive default policy is available. For more information, see Verifying Lakehouse Optimizer health and policy script.
  10. Cloudera Lakehouse Optimizer administrators must perform the following actions to initiate the Iceberg table maintenance activity:
    1. Onboard a namespace using the PUT /namespaces/{namespace} API. The API informs Cloudera Lakehouse Optimizer to include the associated tables in the namespace for maintenance.
    2. Define or create the policy. For more information, see Defining Lakehouse Optimizer resources.
    3. Associate or subscribe the tables in the onboarded namespace to the required policy using the POST /policies/{policyName}/tables/{tableName}/subs API. The action modifies or appends the mapping of the Iceberg tables to the policies in the association file.
      You must run the PUT /policies/{policyName}/tables/{tableName}/subs API for subsequent associations. For more information, see Understanding table-policy associations.
    4. Reschedule the namespace to update the policies for all the tables in an existing namespace using the PATCH /namespaces/{namespace} API.
    5. Optionally, dry run the policy, during which Cloudera Lakehouse Optimizer only generates the table maintenance actions and does not initiate the maintenance actions. For instructions, see Step 2 of Performing manual table maintenance.
Wait for Cloudera Lakehouse Optimizer to initiate the table maintenance based on the CRON schedule in the policies.

Optionally, you can also initiate a manual table maintenance, when required. Ensure that you dry run the policy before manual maintenance. For more information, see Performing manual table maintenance.

To monitor the policies, view the various methods that are available to manage and monitor the table maintenance tasks.