Preparing and defining Cloudera Lakehouse Optimizer policies using REST
APIs
Before you create or define the Cloudera Lakehouse Optimizer policies using
REST APIs, the administrator must complete the necessary prerequisites. After defining the
policy, the administrator must onboard the required namespace, associate the required tables
to the policy, and reschedule the namespace.
Consider the best
practices before you create or define the policy.The following section explains the prerequisites, defining the policy, and
post-requisites that you must complete using REST APIs for Cloudera Lakehouse Optimizer to initiate table maintenance.
Contact your Cloudera account team to enable the Cloudera Lakehouse Optimizer service for your Cloudera Open Data
Lakehouse environment.
Use an existing Cloudera Base on premises cluster, or create a dedicated Cloudera Lakehouse Optimizer cluster.
Ensure that the cluster contains the Zookeeper, HDFS, Hive, YARN,
Spark3_on_yarn, Oozie, Hue, Livy_for_spark3, Ranger, Knox, and Meteringv2
services.
Add the Cloudera Lakehouse Optimizer service to the cluster.
If required, enable the Ranger service for Cloudera Lakehouse Optimizer, and
then create the Ranger policies to provide the fine-grained access to a user or
group. For instructions, see Providing fine-grained access to namespaces using Ranger.
Modify the default values for Spark Executor Memory and Spark Driver Memory.
The default values are 8 GB and 4 GB respectively. The default memory settings
might be enough for a majority of the use cases. However, for heavy workloads
you might want to increase these values. For more information, see Troubleshooting: Configuring the Spark engine to
optimize the rewrite compaction job.
To modify the spark.driver.memory and
spark.executor.memory settings, go to the Cloudera Manager > Clusters > [***CLOUDERA LAKEHOUSE
OPTIMIZER***] > Configuration > conf/dlm-client.properties_role_safety_valve property.
Optionally, configure custom YARN queues to use for Cloudera Lakehouse
Optimizer workloads. This ensures that the policy runs do not overlap with other
workloads.
For all node groups, configure the
yarn.scheduler.maximum-allocation-mb and
yarn.nodemanager.resource.memory-mb advanced configuration
snippets on the Cloudera Manager > Clusters > YARN service > Configuration tab to more than 8 GB depending on your requirements.
By default, the dlm_admin, dlm_operator, or
dlm_monitor groups are available that are assigned to the
Cloudera Lakehouse Optimizer administrator, operator, or monitor role
respectively. You can add one or more comma-separated groups to perform the
administrator, operator, or monitor roles.
To view the default groups or to add more groups, perform the following steps:
Go to the Cloudera Manager > Clusters > [***CLOUDERA LAKEHOUSE
OPTIMIZER***] > Configuration tab.
For the administrator role, search for the DLM
Security Role Admin property. By default, the property
contains the dlm_admin group. You can append more
groups that are comma separated. For example,
dlm_admin, clo_admin.
For the operator role, search for the DLM
Security Role Operator property. By default, the
property contains the dlm_operator group. You can
append more groups that are comma separated.
For the monitor role, search for the DLM
Security Role Monitor property. By default, the
property contains the dlm_monitor group. You can
append more groups that are comma separated.
Cloudera Lakehouse Optimizer administrators must perform the following
actions to verify whether the REST APIs are accessible and the service is
available:
Cloudera Lakehouse Optimizer administrators must perform the following
actions to initiate the Iceberg table maintenance activity:
Onboard a namespace using the PUT
/namespaces/{namespace} API. The API informs Cloudera Lakehouse Optimizer to include the associated tables in
the namespace for maintenance.
Associate or subscribe the tables in the onboarded namespace to the
required policy using the POST
/policies/{policyName}/tables/{tableName}/subs API. The
action modifies or appends the mapping of the Iceberg tables to the
policies in the association file.
You must run the PUT
/policies/{policyName}/tables/{tableName}/subs API for
subsequent associations. For more information, see Understanding table-policy
associations.
Reschedule the namespace to update the policies for all the tables in
an existing namespace using the PATCH
/namespaces/{namespace} API.
Optionally, dry run the policy, during which Cloudera Lakehouse Optimizer only generates the table maintenance
actions and does not initiate the maintenance actions. For instructions,
see Step 2 of Performing
manual table maintenance.
Wait for Cloudera Lakehouse Optimizer to initiate the table maintenance based
on the CRON schedule in the policies.
Optionally, you can also initiate a manual table
maintenance, when required. Ensure that you dry run the policy before manual
maintenance. For more information, see Performing manual table maintenance.