A data architect wants to perform data file compaction and snapshot management using
Cloudera Lakehouse Optimizer when an HMS event is triggered within a namespace.
The HMS events include insert, update, and delete operations on the tables in the
namespace.
Understanding the use case
Use case
A namespace undergoes a lot of usage, and therefore the data architect must
run the data file compaction operation and perform snapshot management manually on
the namespace multiple times depending on the usage. This action ensures that the
users querying the tables get fast responses without any degradation, and this
action also ensures the users get the correct snapshots for potential rollbacks and
replication activities as well. However, this is a recurring manual activity that
consumes time, and prevents the data architect from working on other higher
value-add tasks.
Cloudera Lakehouse Optimizer is already deployed in the environment
and the data architect can use this tool to help with this issue.
Desired outcome
The data architect wants to automate the compaction and snapshot management
activities for the tables so as to deliver optimum performance and value. These
tables are business critical and are frequently updated, so the Cloudera Lakehouse Optimizer policy must evaluate and execute the maintenance
actions alongside the table updates.
The data architect performs the following steps to automate the table maintenance
actions in the Lakehouse Optimizer UI:
-
Action 1 – Policy creation
Creates a Cloudera Lakehouse Optimizer policy where all the previously
performed manual actions are captured in the policy as automated actions:
-
Go to the page.
-
Select the Policies tab.
-
Click Create Policy.
-
Enter the following details on the General
page:
- Select Namespace scope. This
ensures the policy is applied to an entire namespace.
- Provide a unique name
Critical_Snapshot_and_Compaction. The
provided name shows that it is relevant to the tables it applies
to and the actions it runs.
- Enter the description for the policy as
Event-based compaction and snapshot management
actions for frequently updated, business critical
tables. This description helps other Cloudera Lakehouse Optimizer users to understand the
policy's goal.
- Review the details, and click
Next.
-
Enter the following details on the Associations
page:
- Select the namespaces to subscribe to the policy.
- Review the selected tables and namespaces, and click
Next.
-
Enter the following details on the Policy Actions
page:
- Select the Cloudera Adaptive
Policy as the policy template.
- Select Event Based as the table
maintenance schedule because the policy must evaluate based on
the table events.
- Select Compaction in the
Automated Actions section.
- Change the Target File
Size value to
419400000, that is 400MB in
bytes.
- Enable Partial
Progress.
- Select
ExpireSnapshot
perform the following steps:
- Choose the default value for Expire
Older Than.
- Choose the default value for Retain
Last.
- Enable Clean Expired
Files.
- Click Next.
-
Review the details on the Review page, and click
Create Policy.
The policy is created successfully.
-
Action 2 – Review policy implementation
-
To ensure that the policy has been applied as intended, the data
architect performs the following steps:
- Ensures the “Critical_Snapshot_and_Compaction” policy appears on
the Policies tab.
- Reviews the actions, namespaces, and the tables that the policy
is associated with on the Policy Details
page.
-
To ensure that all of the intended tables have both policies applied as
intended, the data architect performs the following steps:
- Confirms whether all the required tables and
namespaces are subscribed to the required policies on the
Tables tab.
- Confirms whether the policy appears on the side panel
when you click the table name on the Tables
tab.