Managing Metadata Storage with Purge

The volume of metadata maintained by Navigator Metadata Server can grow quickly and exceed the capacity of the Solr instance that processes the index, which can affect search results speed and time to display data lineage. In addition, stale metadata may show relationships that no longer exist, or the lineage may take longer to display than necessary as the system processes extraneous details.

Cloudera Navigator's Purge function removes metadata for files that have been deleted or that are older than the specified timeframe. The result is faster search and more precise (up-to-date) lineage diagrams.

In addition, using Purge before upgrading Cloudera Navigator to a new release can speed-up the upgrade process and reduce the chance of out-of-memory errors.

The Purge function can be used in a few different ways:

Scheduling the Purge Process

Use the Cloudera Navigator console to configure a schedule for a regular weekly Purge of deleted and stale metadata from the Navigator Metadata Server and its associated database.

To configure Purge schedule:
  1. Log in to the Cloudera Navigator console using an account with privileges as either Cloudera Manager Full Administrator or Navigator Administrator. The URL to access the Cloudera Navigator console directly (rather than from within Cloudera Manager) using the default port (7187) on the host running the Navigator Metadata Server role is as follows:
    http://fqdn-1.example.com:7187/login.html
  2. Enter your administrator user account and password at the login page.
  3. Click the Purge Settings tab. The current Metadata and Lineage purge schedule displays, along with lists of up to five upcoming scheduled purges and a list of up to five most recent completed purges.

To change the existing schedule:
  1. Click the Edit button.
  2. Set the day, time, maximum purge duration, and time frame to hold on to deleted entities (Purge entities deleted more than*) settings best for your environment. See the descriptions and usage notes for these settings in the table below.
    Property Default Range of selectable values and usage note
    How often Weekly Weekly. Not configurable. The Purge runs weekly per your specifications for Day and Time. It is enabled by default.
    Day Saturday Days of week, Sunday through Saturday. Select a day for the purge that will have minimal impact to your user community.
    Time 12 Midnight Hourly time, from 12 Midnight through 11 PM. Select a time that will have minimal impact on production.
    Maximum purge duration 12 hours 10 minutes, 1 hour though 10 hours, 12 hours, 14 hours, 16 hours, 18 hours, 20 hours, 22 hours, 24 hours, 36 hours, 48 hours, 3 days through 7 days, inclusive. Set the amount of time you want to allow for the Purge process to run. The process will not run beyond your specified duration even if it has not completed the purge. Entities purged to that point remain purged. No other Cloudera Navigator operations can occur during the Purge process.
    Purge HDFS entities deleted more than* 60 days Select 1 day through 10 days, 20 days through 100 days (in 10-day increments), 150 days, 365 days. These are the number of days after entity deletion that elapse until the purge process removes it. For example, a setting of 1 day purges entities deleted before yesterday but retains entities deleted yesterday.
    Purge SELECT operations* Enabled Hive and Impala SELECT operations older than days specified in Only Purge SELECT operations older than will be purged.
    Purge operations older than* 60 days Select 10 days through 100 days (10-day increments), 150 days, 365 days. Yarn, Sqoop, and Pig operations older than the specified date will be purged. If Purge SELECT Operations is enabled, Hive and Impala SELECT operations older than the specified date will also be purged.
  3. If your system processes Hive and Impala queries, you can have these purged on a regular basis as well. Set appropriate thresholds for your use cases.
  4. Click Save when finished.

Here is an example of a revised schedule: