Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out_kafka: emit a metric when librdkafka signals an error #9588

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

seveas
Copy link
Contributor

@seveas seveas commented Nov 12, 2024


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • [N/A] Example configuration file for the change
  • [N/A] Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [N/A] Run local packaging test showing all targets (including any new ones) build.
  • [N/A] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • [N/A] Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@seveas
Copy link
Contributor Author

seveas commented Nov 12, 2024

Curl output showing the metric using a config that's intentionally broken (bad auth). As you can see no dropped records according to existing metrics, but the new metric shows the errors.

$ curl --silent  http://localhost:2020/api/v2/metrics/prometheus | grep azure
fluentbit_output_proc_records_total{name="azure"} 5744
fluentbit_output_proc_bytes_total{name="azure"} 7000396
fluentbit_output_errors_total{name="azure"} 0
fluentbit_output_retries_total{name="azure"} 0
fluentbit_output_retries_failed_total{name="azure"} 0
fluentbit_output_dropped_records_total{name="azure"} 0
fluentbit_output_retried_records_total{name="azure"} 0
fluentbit_output_kafka_errors_total{name="azure"} 1607
fluentbit_output_upstream_total_connections{name="azure"} 0
fluentbit_output_upstream_busy_connections{name="azure"} 0
fluentbit_output_chunk_available_capacity_percent{name="azure"} 100

@seveas seveas changed the title Kafka errors metric out_kafka: emit a metric when librdkafka signals an error Nov 12, 2024
The async nature of the kafka output makes the
fluentbit_output_dropped_records_total insufficient to determine whether there
are problems sending messages to kafka. fluent-bit considers a message
delivered when it has been added to the librdkafka buffers. If librdkafka
subsequently fails to deliver the message, the only feedback is a log message
such as:

```
[2024/11/12 07:56:45] [ warn] [output:kafka:azure] message delivery failed: Local: Message timed out
```

So let's add a metric that exposes how often librdkafka signals that it has
problems talking to kafka.

Signed-off-by: Dennis Kaarsemaker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant