Monitoring/Reducing CDSW disk space

If the /var/lib/cdsw disk on the CDSW master node fills up to 80% or more, then the CDSW system starts to act very poorly, with symptoms such as unable to start sessions, Kerberos tickets not working, unable to create projects, and pods just crashing. There are a number of maintenance chores that Admin teams should perform monthly on CDSW to clean up the disk storage space

Possible symptoms that CDSW storage is filling up include but are not limited to:
  • unable to start sessions
  • sessions randomly get killed with "Engine exited with status 1"
  • unable to build models
  • unable to "kinit" through Hadoop Authentication
  • the UI is slow

Any of these symptoms "Go away" temporarily when CDSW is restarted. This is a major point that the problem is with the storage space, since restarting CDSW will free up a lot of temporary space.

Remove Deleted Orphan Projects

When projects are deleted from the CDSW UI, the files are left on the disk. These projects cannot be recovered in CDSW UI, since they have been removed from the database. Therefore, since nobody can view these files, they are completely safe to delete. The following command will show these files that can be safely deleted:

{{
for project_id in `ls /var/lib/cdsw/current/projects/projects/*/`
do
 echo "Processing $project_id"
 rows=`kubectl exec $(kubectl get pods -l role=db --no-headers=true -o custom-columns=NAME:.metadata.name) -ti -- psql -U sense -P pager=off -c "SELECT * FROM projects WHERE ID = $project_id" | grep '0 row' | wc -l`
 if [ $rows -gt 0 ]
 then
 echo -n "Project with ID $project_id has been deleted, you can archive its directory: " `ls -d /var/lib/cdsw/current/projects/projects/*/$project_id`;echo
 fi
done
}}

This will print out all of the projects that can safely be removed from the disk. These projects have already been deleted from the UI, so there is no risk that collaborators cannot access the files, etc.

After you run this command and clean up these orphaned projects, check the disk usage again to see if you are back below 70%.

Identify Huge Projects on Disk

If you still need to clean up some disk space, the next best solution is to look through the file system and find huge projects on disk. Projects can grow huge for a number of reasons. It is important to understand why so that the Admin team can educate the end users to prevent this from happening in the future. The biggest reasons for the project growing to crazy sizes are:
  • Users are doing ML training on huge training data sets, and instead of using the data in HDFS, they pull massive training sets into their projects.

    Resolution: Pull data from Hadoop and run your training, then delete this data. Or simply train the algorithms on the CDH cluster via spark or some other method.

  • Users have created ML jobs that run on a scheduled basis and these generate output artifacts. After enough time, this can grow to substantial sizes.

    Resolution: Add another job to delete this old data.

You can use the following script to list the projects in the system, along with the username, project name, email, and size. This will spit things out in a CSV format, and can be thrown into Excel and used, for example, to email the top ten users and ask them if it is possible to clean up their project sizes.
echo "Project Path, Size, Project Name, User name, email"
for project_id in `du -hx --max-depth=2 /var/lib/cdsw/current/projects/projects/ | sort -hr | awk 'NR>2 {print $2}'`
do
 echo -n $project_id ","
 fileNameWithFullPath="${project_id%/}";
 pid="${fileNameWithFullPath##*/}" 
 rows=`kubectl exec $(kubectl get pods -l role=db --no-headers=true -o custom-columns=NAME:.metadata.name) -ti -- psql -P pager=off -U sense -c "SELECT * FROM projects WHERE ID = $pid" | grep '0 row' | wc -l`
 if [ $rows -gt 0 ]
 then
 echo -n "Project with ID $pid has been deleted, you can archive its directory: " `ls -d /var/lib/cdsw/current/projects/projects/*/$pid`;echo
 else
 size=`du -sh /var/lib/cdsw/current/projects/projects/0/$pid | awk '{print $1}'`
 echo -n "$size,"
 user_data=`kubectl exec $(kubectl get pods -l role=db --no-headers=true -o custom-columns=NAME:.metadata.name) -ti -- psql -U sense -t -A -c "select p.name,u.name, u.email from projects p, users u where p.user_id=u.id and p.id=$pid;" | tr '|' ','`;
 echo "$user_data"
 fi
done

You should put this inside a file, chmod to 777, and run it like get-large-projects.sh > large-projects.csv

This can make a significant difference. In practice, the top ten projects can take up 30-40% of the free space.