Cleaning up old data to improve performance in Cloudera Data Warehouse

Some tables in Hue retain data indefinitely, resulting in slower performance or application crashes. Hue does not automatically clean up data from these tables. You can configure Hue to retain the data for a specific number of days and then schedule a cron job to clean up these tables at regular intervals for improved performance.

Consider cleaning up old data from the backend Hue database if you face the following problems while using Hue:
  • Upgrade times out
  • Performance is slower than expected
  • Long time to log in to Hue
  • SQL query shows a large number of documents in tables
  • Hue crashes while trying to access saved documents
Back up your database before starting the cleanup activity. Check the saved documents, such as queries for a few users, to prevent data loss.
  1. SSH into an active Hue instance.
  2. Run the following command as the root user.
    kubectl exec -it hue-0 -n <namespace> – /bin/bash
    This command provides interactive access to the Hue pod (hue-0) within the Kubernetes cluster.
  3. Enter the Hue Database Shell.
    /opt/hive/build/env/bin/hue dbshell
    This provides access to the Hue backend database shell. You can also note the sizes of the tables you want to clean up as a reference by running the following queries:
    select count(*) from desktop_document;
    select count(*) from desktop_document2;
    select count(*) from beeswax_session;
    select count(*) from beeswax_savedquery;
    select count(*) from beeswax_queryhistory;
    After completing the queries, exit the Hue database shell to continue.
  4. Navigate to the required directory.
    cd /opt/hive/build/env/bin
    This ensures you are in the correct directory before running additional commands.
  5. Remove outdated documents.
    ./hue desktop_document_cleanup --keep-days <x>
    Removes outdated documents according to the specified retention period (x days).
  6. To confirm document cleanup, repeat step 3 to validate if the table size has reduced.
    SELECT COUNT(*) FROM desktop_document;
    SELECT COUNT(*) FROM desktop_document2;
    SELECT COUNT(*) FROM beeswax_session;
    SELECT COUNT(*) FROM beeswax_savedquery;
    SELECT COUNT(*) FROM beeswax_queryhistory;
  7. Restart the Virtual Warehouse.