Monitoring ML Workspaces

This topic shows you how to monitor resource usage on your ML workspaces.

Cloudera Machine Learning leverages Prometheus and Grafana to provide a dashboard that allows you to monitor how CPU, memory, storage, and other resources are being consumed by ML workspaces. Prometheus is an internal data source that is auto-populated with resource consumption data for each workspace. Grafana is a monitoring dashboard that allows you to create visualizations for resource consumption data from Prometheus.

Each ML workspace has its own Grafana dashboard.

Required Role: EnvironmentAdmin

You need the EnvironmentAdmin to view the Workspace details page.

  1. Log in to the web interface.
  2. Click ML Workspaces.
  3. For the workspace you want to monitor, click Actions > Open Grafana.
CML provides you with several default Grafana dashboards:
  • K8s Cluster: Shows cluster health, deployments, and pods
  • K8s Containers: Shows pod info, cpu and memory usage
  • K8s Node: Shows node cpu and memory usage, disk usage and network conditions
  • Models: Shows response times, requests per second, cpu and memory usage for model replicas.
You might choose to add new dashboards or create more panels for other metrics. For more information, see the Grafana documentation.