Preparing and defining Cloudera Lakehouse Optimizer policies using REST
APIs
Before you create or define the Cloudera Lakehouse Optimizer policies using
REST APIs, the administrator must complete the necessary prerequisites. After defining the
policy, the administrator must onboard the required namespace, associate the required tables
to the policy, and reschedule the namespace.
Consider the best
practices before you create or define the policy.The following section explains the prerequisites, defining the policy, and
post-requisites that you must complete using REST APIs for Cloudera Lakehouse Optimizer to initiate table maintenance.
Ensure that you do not have any other optimization services enabled alongside
Cloudera Lakehouse Optimizer, such as AWS S3 Table Optimization, to
avoid conflicts during the optimization process.
The CDP Environment administrator must perform the
following steps to install and configure Cloudera Lakehouse Optimizer:
Contact your Cloudera account team to enable the Cloudera Lakehouse Optimizer service for your Cloudera Open Data
Lakehouse environment.
Ensure that the environment has the following minimum configuration to
support Cloudera Lakehouse Optimizer:
The AWS environment must have 1x m5.4xLarge (Master
Node); 2x r5d.xLarge (Worker Nodes); r5d.xLarge (Compute Nodes -
0 by default).
The Azure environment must have 1x Standard_D16d_v5
(Master Node)- 2x Standard_E8ds_v5 (Worker Nodes)- 0x
Standard_E8ds_v5 (Compute Nodes - 0 by default).
Modify the default values for Spark Executor Memory and Spark Driver
Memory. The default values are 8 GB and 4 GB respectively. The default
memory settings might be enough for a majority of the use cases.
However, for heavy workloads you might want to increase these
values.
To modify the spark.driver.memory and
spark.executor.memory settings, go to the Cloudera Manager > Clusters > cloudera_lakehouse_optimizer > Configuration > conf/dlm-client.properties_role_safety_valve property.
Cloudera Lakehouse Optimizer administrators must perform the following
actions to verify whether the REST APIs are accessible and the service is
available:
Cloudera Lakehouse Optimizer administrators must perform the following
actions to initiate the Iceberg table maintenance activity:
Onboard a namespace using the PUT
/namespaces/{namespace} API. The API informs Cloudera Lakehouse Optimizer to include the associated tables in
the namespace for maintenance.
Associate or subscribe the tables in the onboarded namespace to the
required policy using the POST
/policies/{policyName}/tables/{tableName}/subs API. The
action modifies or appends the mapping of the Iceberg tables to the
policies in the association file.
You must run the PUT
/policies/{policyName}/tables/{tableName}/subs API for
subsequent associations. For more information, see Understanding table-policy
associations.
Reschedule the namespace to update the policies for all the tables in
an existing namespace using the PATCH
/namespaces/{namespace} API.
Optionally, dry run the policy, during which Cloudera Lakehouse Optimizer only generates the table maintenance
actions and does not initiate the maintenance actions. For instructions,
see Step 2 of Performing
manual table maintenance.
Wait for Cloudera Lakehouse Optimizer to initiate the table maintenance based
on the CRON schedule in the policies.
Optionally, you can also initiate a manual table
maintenance, when required. Ensure that you dry run the policy before manual
maintenance. For more information, see Performing manual table maintenance.
To monitor the policies, use one of the following methods:
View the policy run details or the chosen tables’ maintenance details on the
Policies tab or Tables tab
respectively in the Lakehouse Optimizer UI.