Monitoring and diagnostics
Learn about collecting diagnostics information, the diagnostic tool shipped with CSA Operator, as well as a number of useful kubectl commands that you can use to gather diagnostic information.
In addition to the built-in heath endpoint of the Flink Operator and using the generic
kubectl
command, Cloudera provides a separate command line tool that you
can use to capture diagnostic information about your CSA Operator
installation. You can use these tools when contacting Cloudera support, or when
troubleshooting issues.
Diagnostic bundle
The diagnostic tool is a Python package that collects all relevant resources and logs managed by the CSA Operator and connects to the REST API of running Flink clusters to fetch additional metrics. It generates a zip file that can be shared with Cloudera support or examined for troubleshooting.
/csa-operator/1.0/tools/
directory on Cloudera
Archive, and use the following steps to install and create the diagnostic bundle:- Create a Python virtual
environment.
mkdir venv python3 -m venv venv cd venv source bin/active
- Install the CSA diagnostic tool with
pip
install.pip install ../csaop-diagnostircs-1.0.0.tar.gz
- Run the diagnostic
tool.
The following optional arguments can be provided to the diagnostic tool:csaop-generate-bundle
- By default, the diagnostic tool generates the zip file in the current working
directory, but you can provide the path of a custom directory using the
-o [OUTPUT_DIR]
argument.
- By default, the diagnostic tool generates the zip file in the current working
directory, but you can provide the path of a custom directory using the
Pod status with kubectl
kubectl
describe
:kubectl describe --namespace [***NAMESPACE***]
Operator log with kubectl
kubectl
logs
:kubectl logs [***FLINK OPERATOR POD***] --namespace [***NAMESPACE***]
Health endpoint
The Flink Operator provides a built-in health endpoint that serves as the information source for Kubernetes liveness and startup probes. The health probes are enabled by default in the Helm chart as shown in the following example:
operatorHealth:
port: 8085
livenessProbe:
periodSeconds: 10
initialDelaySeconds: 30
startupProbe:
failureThreshold: 30
periodSeconds: 10
The health endpoint catches startup and informer errors that are exposed by the Java Operator SDK (JOSDK) framework. By default, if one of the watched namespaces becomes inaccessible, the health endpoint will report an error and the Flink Operator restarts.
If the Flink Operator needs to be running, even if some namespaces are not accessible, you can use the kubernetes.operator.startup.stop-on-informer-error configuration and set it to false to disable the automatic restart behavior. This way the Flink Operator will start even if some namespaces cannot be watched.