Cleaning up old data to improve performance

Some tables in Hue retain data indefinitely resulting in slower performance or application crash. Hue does not automatically clean up data from these tables. You can configure Hue to retain the data for a specific number of days and then schedule a cron job to clean up these tables at regular intervals for improved performance.

Consider cleaning up old data from the backend Hue database if you face the following problems while using Hue:
  • Upgrade times out
  • Performance is slower than expected
  • Long time to log in to Hue
  • SQL query shows a large number of documents in tables
  • Hue crashes while trying to access saved documents
Back up your database before starting the cleanup activity. Check the saved documents such as Queries and Workflows for a few users to prevent data loss. You can also note the sizes of the tables you want to clean up as a reference by running the following query:
select count(*) from desktop_document;
  1. Sign in to a Hue instance that is active. The hue_scripts script needs an active Hue configuration to run.
  2. Download the hue_scripts file by any of the following methods:
    git clone https://github.com/<git-username>/hue_scripts.git /opt/cloudera/hue_scripts
    or
    wget -qO- -O /tmp/hue_scripts.zip https://github.com/<git-username>/hue_scripts/archive/master.zip && unzip -d /tmp /tmp/hue_scripts.zip
    mv /tmp/hue_scripts-master /opt/cloudera/hue_scripts
  3. Run the script as the root user:
    DESKTOP_DEBUG=True /opt/cloudera/hue_scripts/script_runner hue_desktop_document_cleanup --keep-days 30
    The logs are displayed on the console because DESKTOP_DEBUG is set to True. Alternatively, you can view the logs from the following location:
    /var/log/hue/hue_desktop_document_cleanup.log
    The first run can typically take around 1 minute per 1000 entries in each table.
  4. Check whether the table-size has decreased by running a query as follows:
    select count(*) from desktop_document;
    If the hue_scripts script has run successfully, the table size should decrease, and you can now set up a cron job for scheduled cleanups.
  5. Copy the wrapper script for cron by running the following command:
    cp /opt/cloudera/hue_scripts/hue_history_cron.sh /etc/cron.daily
  6. Specify the cleanup interval in the --keep-days property in the hue_history_cron.sh file as shown in the following example:
    ${SCRIPT_DIR}/script_runner hue_desktop_document_cleanup --keep-days 120
    In this case, the data will be retained in the tables for 120 days.
  7. Change the permissions on the script so only the root user can run it.
    chmod 700 /etc/cron.daily/hue_history_cron.sh
Set up a cron job that runs at regular intervals to automate the database cleanup. For example, you can set up a cron job to run daily and it purges data older than x number of days.