Monitoring with Prometheus

Learn about the example files Cloudera provides related to Prometheus monitoring. Additionally, learn about recommended metrics, alerts, and the Kafka Exporter.

Cloudera provides various example files related to the setup and configuration of Prometheus monitoring. These example files configure Kafka and other cluster components to expose recommended metrics. Additionally, they set up a Prometheus instance that scrapes exposed metrics and publishes recommended alerts.

The example files are hosted on the Cloudera Archive. They are located in /csm-operator/1.0/examples/metrics/.

The example files related to Prometheus are as follows.

  • kafka-metrics.yaml – A configuration file that includes Kafka, KafkaNodePool, and ConfigMap resource examples. You can use this configuration example to deploy a Kafka cluster that exposes the recommended metrics that Prometheus can scrape.

    If you already have a cluster and want to configure your existing Kafka to expose metrics, review the ConfigMap manifest in this example file. The ConfigMap specifies what Kafka and ZooKeeper metrics are exposed. For comprehensive steps on how to configure Kafka and ZooKeeper to expose Prometheus compatible metrics, see Configuring Kafka for Prometheus monitoring.

  • /prometheus-install/
    • alertmanager.yaml – An example AlertManager resource for deploying and configuring the Prometheus Alertmanager.
    • prometheus.yaml – A configuration file that you can use to deploy a Prometheus server.
    • prometheus-rules.yaml – An example PrometheusRule resource that includes alert rules recommended by Cloudera.
    • strimzi-pod-monitor.yaml – Includes examples of PodMonitor resources. These resources define Prometheus jobs that scrape metrics directly from pods. Podmonitor resources are used to scrape metrics data directly from Kafka, ZooKeeper, Operator, Kafka Bridge and Cruise Control pods.
  • /prometheus-additional-properties/
    • prometheus-additional.yaml – A Secret resource that stores additional Prometheus configuration for scraping CPU, memory, and disk volume usage metrics. These metrics are reported by the Kubernetes cAdvisor agent and kubelet on the nodes.
  • /prometheus-alertmanager-config/
    • alert-manager-config.yaml – A Secret resource containing additional configuration for the Prometheus Operator. The configuration specifies hook definitions for sending notifications from your Kafka cluster through the Alertmanager.

Prometheus metrics

To expose metrics recommended by Cloudera, set up and configure both Prometheus and Kafka instances using the example configuration files provided by Cloudera.

The example files are hosted on the Cloudera Archive. They are located in /csm-operator/1.0/examples/metrics/. When you deploy using these examples, your deployment exposes and monitors the metrics recommended by Cloudera.

The specific metrics that you need to monitor will depend on your use case and operational objectives. As a result, any metric can be useful and there are no metrics that can be highlighted. Start out with the provided example files and make changes as necessary.

Prometheus alerts

Cloudera provides a Prometheus alert configuration example that contains recommended alert rules. Learn about the highlighted alerts defined in this example. Additionally, learn about configuring custom alerts for Kafka components and the Strimzi Cluster Operator.

Default (recommended) alerts

The prometheus-rules.yaml file is an example PrometheusRule resource that has various alert rules specified. The alerts specified in this example are generally useful for most use cases and all of them are recommended by Cloudera.

The following is a list of the highlighted alerts defined in the example. Ensure that you always have these alerts configured as they give good insight into the state and health of your cluster.

  • KafkaRunningOutOfSpace – Kafka is running out of free disk space. Reported for each Persistent Volume Claim.
  • UnderReplicatedPartitions – Kafka has underreplicated partitions. Reported for each Kafka pod.
  • OfflinePartitions – One or more partitions have no leader on the actual Kafka pods.
  • OfflineLogDirectoryCount – Reports the number of offline log directories located on the actual Kafka pod.
  • KafkaNetworkProcessorAvgIdle – Less than 30% of network processor capacity available on the actual Kafka pod. You can avoid this alert by increasing the broker property.
  • KafkaRequestHandlerAvgIdle – Less than 30% of request handler capacity is available on the actual Kafka pod. You can avoid this alert by increasing the broker property.
  • ClusterOperatorContainerDown – The Strimzi Cluster Operator has been down for longer than 90 seconds.
  • AvgRequestLatency – Zookeeper average request latency on the pod.
  • ZookeeperRunningOutOfSpace – Zookeeper is running out of free disk space.

Custom alerts and groups

In addition to the default alerts, you can define custom ones as well. To do this, you extend your PrometheusRule resource (prometheus-rules.yaml) with additional alert rules.

Alert rules are grouped and the prometheus-rules.yaml example contains the following default groups.

  • kafka
  • zookeeper
  • entityOperator
  • kafkaExporter

For example, to monitor Cruise Control, you must introduce a Cruise Control group that contains valid alert rules. Specifying a new group is also useful if you want to identify host machine related problems. You can find more information on defining alert rules in the Prometheus documentation.

Alerts for the Strimzi Cluster Operator

By default, prometheus-rules.yaml contains a single alert related to the Strimzi Cluster Operator. This alert monitors whether the container of the operator is down. You can define additional alerts using the following metrics.

  • strimzi_reconciliations_already_enqueued_total – Number of reconciliations skipped because another reconciliation for the same resource was still running.
  • strimzi_reconciliations_duration_seconds – The time reconciliation takes to complete.
  • strimzi_reconciliations_duration_seconds_max – Max time of reconciliation takes.
  • strimzi_reconciliations_failed_total – Number of failed reconciliations done by the operator for individual resources which failed.
  • strimzi_reconciliations_locked_total – Number of reconciliations skipped because another reconciliation for the same resource was still running.
  • strimzi_reconciliations_max_batch_size – Max size recorded for a single event batch.
  • strimzi_reconciliations_periodical_total – Number of periodical reconciliations done by the operator.
  • strimzi_reconciliations_successful_total – Number of reconciliations done by the operator for individual resources which were successful.
  • strimzi_reconciliations_total – Number of reconciliations done by the operator for individual resources.
  • strimzi_resources – Number of custom resources the operator sees.
  • strimzi_resources_paused – Number of custom resources with paused reconciliations.

Kafka Exporter

You can use Kafka Exporter to publish additional Kafka metrics related to brokers and clients.

Kafka Exporter is an open source project to enhance the monitoring of Apache Kafka brokers and clients. Kafka Exporter extracts additional metrics data from Kafka brokers related to offsets, consumer groups, consumer lag, and topics.

If Kafka Exporter is deployed, it is typically deployed with its default configuration (spec.kafkaExporter: {}). Cloudera recommends that you deploy Kafka Exporter and customize its configuration based on your cluster and operational objectives.

Cloudera recommends that at minimum you capture additional metrics for your mission critical topics and groups. Additional metrics include metrics related to latest offsets, consumer lags, and others.

The following example configures the Kafka Exporter to collect additional metrics from all topics and groups.

kind: Kafka
  name: my-cluster
    topicRegex: ".*"
    groupRegex: ".*"

This configuration snippet is included in the kafka-metrics.yaml example provided by Cloudera, which is the recommended baseline example for a Kafka deployment that has metric collection enabled.