Monitoring ML Workspaces

This topic shows you how to monitor resource usage on your ML workspaces.

Cloudera Machine Learning leverages Prometheus and Grafana to provide a dashboard that allows you to monitor how CPU, memory, storage, and other resources are being consumed by ML workspaces. Prometheus is an internal data source that is auto-populated with resource consumption data for each workspace. Grafana is a monitoring dashboard that allows you to create visualizations for resource consumption data from Prometheus.

Each ML workspace has its own Grafana dashboard.

Required Role: MLAdmin

You need the MLAdmin role to view the Workspace details page.

Log in to the CDP web interface.
Click ML Workspaces.
For the workspace you want to monitor, click Actions > Open Grafana.

CML provides you with several default Grafana dashboards:

K8s Cluster: Shows cluster health, deployments, and pods
K8s Containers: Shows pod info, cpu and memory usage
K8s Node: Shows node cpu and memory usage, disk usage and network conditions
Models: Shows response times, requests per second, cpu and memory usage for model replicas.

You might choose to add new dashboards or create more panels for other metrics. For more information, see the Grafana documentation.

We want your opinion

How can we improve this page?

What kind of feedback do you have?