Cleaning up old queries, DAG information, and reports data in a CDP Private Cloud Base deployment using Ambari

The DAS Postgres database stores all the queries that you run from the DAS UI or beeline, and all the data that is used to generate the DAG information and reports. Over a period of time, this can grow in size. To optimize the available capacity, DAS has a cleanup mechanism that, by default, purges all the queries and DAG information older than 30 days and purges old reports after 365 days. However, you can customize the cleanup frequency by adding the cleanup.query-info.interval, and cleanup.report-info.interval configurations, and the Cron expression: cleanup.cron.expression in the das-event-processor.json file from Ambari.

  1. From the Ambari UI, go to Data Analytics Studio > CONFIGS > Advanced data_analytics_studio-event_processor-properties.
  2. To customize the cleanup intervals, under Data Analytics Studio Event Processor config file template, add the three new configurations in the event-processing section as shown in the following example:
    "event-processing": {
            "hive.hook.proto.base-directory": "{{data_analytics_studio_event_processor_hive_base_dir}}",
            "tez.history.logging.proto-base-dir": "{{data_analytics_studio_event_processor_tez_base_dir}}",
            "meta.info.sync.service.delay.millis": 5000,
            "actor.initialization.delay.millis": 20000,
            "close.folder.delay.millis": 600000,
            "reread.event.max.retries": -1,
            "reporting.scheduler.initial.delay.millis": 30000,
            "reporting.scheduler.interval.delay.millis": 300000,
            "reporting.scheduler.weekly.initial.delay.millis": 60000,
            "reporting.scheduler.weekly.interval.delay.millis": 600000,
            "reporting.scheduler.monthly.initial.delay.millis": 90000,
            "reporting.scheduler.monthly.interval.delay.millis": 900000,
            "reporting.scheduler.quarterly.initial.delay.millis": 120000,
            "reporting.scheduler.quarterly.interval.delay.millis": 1200000,
            “cleanup.query-info.interval”: 2592000,
            “cleanup.report-info.interval”: 31536000,
            “cleanup.cron.expression: "0 0 2 * * ?"
      },
    In this example, the query data will be cleaned up after 2592000 seconds (which is equal to 30 days), the report data will be cleaned up after 31536000 seconds (which is equal to 365 days), and the clean up jobs will be triggered to run at 02:00:00 hours (or 2 AM), as per the server timezone.
  3. Click Save.
  4. Restart all the required services.