Scaling issues pertaining to Logging and Diagnostic bundle collection

Logging and diagnostic bundle collection pipeline cannot process large volumes of data.

Condition

If the cdp-fluentd-aggregator pod repeatedly restarts with an OOMKilled exit status, the aggregator's memory limit might be insufficient for the current log volume or number of active workers.

Log messages such as buffer flush took too long or retry flush in the aggregator pod logs indicate that the underlying Longhorn storage is not keeping pace with the incoming log rate.

Cause

Scaling issues occur in the logging and diagnostic bundle collection pipeline due to lack of dynamic scalability to handle large volume of logs.

Remedy

Contact Cloudera Support and share the following information:

  • The output of the kubectl describe pod cdp-fluentd-aggregator-0 -n cdp command.
  • The output of the kubectl logs cdp-fluentd-aggregator-0 -n cdp --previous command if the pod has restarted.
  • The current PVC usage using the kubectl get pvc -n cdp | grep logs-rwx command.
  • The number of active Fluentd aggregator replicas and your approximate log ingestion rate.