Monitoring/Reducing CDSW disk space
If the /var/lib/cdsw
disk on the CDSW master node fills up to 80% or
more, then the CDSW system starts to act very poorly, with symptoms such as unable to start
sessions, Kerberos tickets not working, unable to create projects, and pods just crashing. There
are a number of maintenance chores that Admin teams should perform monthly on CDSW to clean up
the disk storage space
- unable to start sessions
- sessions randomly get killed with "Engine exited with status 1"
- unable to build models
- unable to "kinit" through Hadoop Authentication
- the UI is slow
Any of these symptoms "Go away" temporarily when CDSW is restarted. This is a major point that the problem is with the storage space, since restarting CDSW will free up a lot of temporary space.
Remove Deleted Orphan Projects
When projects are deleted from the CDSW UI, the files are left on the disk. These projects cannot be recovered in CDSW UI, since they have been removed from the database. Therefore, since nobody can view these files, they are completely safe to delete. The following command will show these files that can be safely deleted:
{{
for project_id in `ls /var/lib/cdsw/current/projects/projects/*/`
do
echo "Processing $project_id"
rows=`kubectl exec $(kubectl get pods -l role=db --no-headers=true -o custom-columns=NAME:.metadata.name) -ti -- psql -U sense -P pager=off -c "SELECT * FROM projects WHERE ID = $project_id" | grep '0 row' | wc -l`
if [ $rows -gt 0 ]
then
echo -n "Project with ID $project_id has been deleted, you can archive its directory: " `ls -d /var/lib/cdsw/current/projects/projects/*/$project_id`;echo
fi
done
}}
This will print out all of the projects that can safely be removed from the disk. These projects have already been deleted from the UI, so there is no risk that collaborators cannot access the files, etc.
After you run this command and clean up these orphaned projects, check the disk usage again to see if you are back below 70%.
Identify Huge Projects on Disk
- Users are doing ML training on huge training data sets, and instead of using the data
in HDFS, they pull massive training sets into their projects.
Resolution: Pull data from Hadoop and run your training, then delete this data. Or simply train the algorithms on the CDH cluster via spark or some other method.
- Users have created ML jobs that run on a scheduled basis and these generate output
artifacts. After enough time, this can grow to substantial sizes.
Resolution: Add another job to delete this old data.
echo "Project Path, Size, Project Name, User name, email"
for project_id in `du -hx --max-depth=2 /var/lib/cdsw/current/projects/projects/ | sort -hr | awk 'NR>2 {print $2}'`
do
echo -n $project_id ","
fileNameWithFullPath="${project_id%/}";
pid="${fileNameWithFullPath##*/}"
rows=`kubectl exec $(kubectl get pods -l role=db --no-headers=true -o custom-columns=NAME:.metadata.name) -ti -- psql -P pager=off -U sense -c "SELECT * FROM projects WHERE ID = $pid" | grep '0 row' | wc -l`
if [ $rows -gt 0 ]
then
echo -n "Project with ID $pid has been deleted, you can archive its directory: " `ls -d /var/lib/cdsw/current/projects/projects/*/$pid`;echo
else
size=`du -sh /var/lib/cdsw/current/projects/projects/0/$pid | awk '{print $1}'`
echo -n "$size,"
user_data=`kubectl exec $(kubectl get pods -l role=db --no-headers=true -o custom-columns=NAME:.metadata.name) -ti -- psql -U sense -t -A -c "select p.name,u.name, u.email from projects p, users u where p.user_id=u.id and p.id=$pid;" | tr '|' ','`;
echo "$user_data"
fi
done
You should put this inside a file, chmod to 777, and run it like get-large-projects.sh > large-projects.csv
This can make a significant difference. In practice, the top ten projects can take up 30-40% of the free space.