Monitoring and diagnostics

Learn about collecting diagnostics information, the diagnostic tool shipped with CSA Operator, as well as a number of useful kubectl commands that you can use to gather diagnostic information.

In addition to the built-in heath endpoint of the Flink Operator and using the generic kubectl command, Cloudera provides a separate command line tool that you can use to capture diagnostic information about your CSA Operator installation. You can use these tools when contacting Cloudera support, or when troubleshooting issues.

Diagnostic bundle

The diagnostic tool is a Python package that collects all relevant resources and logs managed by the CSA Operator and connects to the REST API of running Flink clusters to fetch additional metrics. It generates a zip file that can be shared with Cloudera support or examined for troubleshooting.

By default, the diagnostic tool is not downloaded, deployed, or installed when you install CSA Operator and its components. To use it, download the Python package located in the /csa-operator/1.0/tools/ directory on Cloudera Archive, and use the following steps to install and create the diagnostic bundle:
  1. Create a Python virtual environment.
    mkdir venv
    python3 -m venv venv
    cd venv
    source bin/active
  2. Install the CSA diagnostic tool with pip install.
    pip install ../csaop-diagnostircs-1.0.0.tar.gz
  3. Run the diagnostic tool.
    csaop-generate-bundle
    The following optional arguments can be provided to the diagnostic tool:
    1. By default, the diagnostic tool generates the zip file in the current working directory, but you can provide the path of a custom directory using the -o [OUTPUT_DIR] argument.
    The path to the generated zip file is diplayed when the diagnostic tool is successfully run.

Pod status with kubectl

You can check the status of the pods after applying a change to the deployment configuration using kubectl describe:
kubectl describe --namespace [***NAMESPACE***]

Operator log with kubectl

The Flink Operator log contains useful information about the tasks that the operator performs and details for failed operations. You can check the Flink Operator logs with kubectl logs:
kubectl logs [***FLINK OPERATOR POD***] --namespace [***NAMESPACE***]

Health endpoint

The Flink Operator provides a built-in health endpoint that serves as the information source for Kubernetes liveness and startup probes. The health probes are enabled by default in the Helm chart as shown in the following example:

operatorHealth:
  port: 8085
  livenessProbe:
    periodSeconds: 10
    initialDelaySeconds: 30
  startupProbe:
    failureThreshold: 30
    periodSeconds: 10

The health endpoint catches startup and informer errors that are exposed by the Java Operator SDK (JOSDK) framework. By default, if one of the watched namespaces becomes inaccessible, the health endpoint will report an error and the Flink Operator restarts.

If the Flink Operator needs to be running, even if some namespaces are not accessible, you can use the kubernetes.operator.startup.stop-on-informer-error configuration and set it to false to disable the automatic restart behavior. This way the Flink Operator will start even if some namespaces cannot be watched.