-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kafka lag metrics gives incorrect value #21134
Comments
Interesting, thanks for this report @fpytloun. Given the values don't match up with kafka-lag-exporter it does seem like a potential bug in Vector or rust-rdkafka or librdkafka. Vector just publishes the metrics returned by rust-rdkafka: vector/src/internal_events/kafka.rs Lines 119 to 152 in fe2cc26
I wonder if the measurement librdkafka is using is different than what kafka-lag-exporter measures? librdkafka documents it as:
https://github.com/confluentinc/librdkafka/blob/master/STATISTICS.md If would be helpful, I could add a log message around here to log the data coming from librdkafka: Lines 175 to 185 in fe2cc26
Vector just ends up using the return values to emit metrics: https://github.com/vectordotdev/vector/blob/master/src/internal_events/kafka.rs#L119-L152 |
Aaah, interesting, nice find @fpytloun |
I think this is caused by #22006 |
A note for the community
Problem
I am working on vector dashboards and I noticed that kafka lag at some point spiked up probably due to some kafka or Elasticsearch glitch. But everything is processing just fine, it is also not growing, just settled on different level. And I don't see any delay in logs delivery either. Also digged deeper into per-partition metrics and it is similar for all (I wanted to make sure it is not just some partitions being stucked).
This chart shows amount of time (given current processing rate) needed to process all unconsumed messages. It would mean we have 1 hour delay which is not true, we have less than 5 minutes.
When I restarted one of vectors, it went back down to 5 minutes level.
I tried to confirm it is vector/rdkafka issue by matching with kafka-lag-exporter metrics and it shows correct value.
Another interesting thing I found is metrics for partition_id
-1
being a negative value 😯Configuration
No response
Version
0.39.0
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: