Use case: Automating manual maintenance steps using Cloudera Lakehouse Optimizer UI

A data architect wants to perform data file compaction and snapshot management using Cloudera Lakehouse Optimizer when an HMS event is triggered within a namespace. The HMS events include insert, update, and delete operations on the tables in the namespace.

Understanding the use case

Use case

A namespace undergoes a lot of usage, and therefore the data architect must run the data file compaction operation and perform snapshot management manually on the namespace multiple times depending on the usage. This action ensures that the users querying the tables get fast responses without any degradation, and this action also ensures the users get the correct snapshots for potential rollbacks and replication activities as well. However, this is a recurring manual activity that consumes time, and prevents the data architect from working on other higher value-add tasks.

Cloudera Lakehouse Optimizer is already deployed in the environment and the data architect can use this tool to help with this issue.

Desired outcome

The data architect wants to automate the compaction and snapshot management activities for the tables so as to deliver optimum performance and value. These tables are business critical and are frequently updated, so the Cloudera Lakehouse Optimizer policy must evaluate and execute the maintenance actions alongside the table updates.

The data architect performs the following steps to automate the table maintenance actions in the Lakehouse Optimizer UI:

  1. Action 1 – Policy creation
    Creates a Cloudera Lakehouse Optimizer policy where all the previously performed manual actions are captured in the policy as automated actions:
    1. Go to the Cloudera Management Console > Lakehouse Optimizer page.
    2. Select the Policies tab.
    3. Click Create Policy.
    4. Enter the following details on the General page:
      1. Select Namespace scope. This ensures the policy is applied to an entire namespace.
      2. Provide a unique name Critical_Snapshot_and_Compaction. The provided name shows that it is relevant to the tables it applies to and the actions it runs.
      3. Enter the description for the policy as Event-based compaction and snapshot management actions for frequently updated, business critical tables. This description helps other Cloudera Lakehouse Optimizer users to understand the policy's goal.
      4. Review the details, and click Next.
    5. Enter the following details on the Associations page:
      1. Select the namespaces to subscribe to the policy.
      2. Review the selected tables and namespaces, and click Next.
    6. Enter the following details on the Policy Actions page:
      1. Select the Cloudera Adaptive Policy as the policy template.
      2. Select Event Based as the table maintenance schedule because the policy must evaluate based on the table events.
      3. Select Compaction in the Automated Actions section.
        1. Change the Target File Size value to 419400000, that is 400MB in bytes.
        2. Enable Partial Progress.
      4. Select ExpireSnapshot perform the following steps:
        1. Choose the default value for Expire Older Than.
        2. Choose the default value for Retain Last.
        3. Enable Clean Expired Files.
      5. Click Next.
    7. Review the details on the Review page, and click Create Policy.
    The policy is created successfully.
  2. Action 2 – Review policy implementation
    1. To ensure that the policy has been applied as intended, the data architect performs the following steps:
      1. Ensures the “Critical_Snapshot_and_Compaction” policy appears on the Policies tab.
      2. Reviews the actions, namespaces, and the tables that the policy is associated with on the Policy Details page.
    2. To ensure that all of the intended tables have both policies applied as intended, the data architect performs the following steps:
      1. Confirms whether all the required tables and namespaces are subscribed to the required policies on the Tables tab.
      2. Confirms whether the policy appears on the side panel when you click the table name on the Tables tab.