Understanding vector cknowledgements and Kafka source configuration options for reliability #21949

kgorskowski · 2024-12-04T13:32:30Z

kgorskowski
Dec 4, 2024

Hi Vector Community,

I'm looking for guidance on how vectors acknowledgements work especially in relation the Kafka source and librdkafka settings.
Goal is to maximize reliability and ensure no messages are lost, especially during restarts or scaling events with a high topic lag.
I recently experienced a loss off events after I had to restart the vector consumer group with an updated configuration while we had a significant lag in the topic and I am not as deep in the subject yet as I would like to be.

at this point the source configuration was pretty basic, relying mostly on the vector default:

[sources.kafka]
type = "kafka"
bootstrap_servers = "kafka:9092"
group_id = "infra-vector"

With no specific librdkafka options aside SSL options.
One of the downward sinks encountered an issue that stopped the complete event flow and lead to a topic lag of a few million events. So reconfigured the transform/sink and restarted the consumer group. My understand was that with global acknowledgements and default consumer group topic offset settings it would reprocess the events in the topic. Even with dynamic group member ids, as we have a non static deployment scaled by a HPA.
But after the restart the topic lag dropped instantly and the events were not reprocessed and indexed to the target elastic.
My questions are:

Does the Kafka source wait for downstream components (e.g., sinks) to confirm processing before committing offsets, or does it commit offsets independently of sink success?

Optimizing Kafka Source for Reliability:
I would be over the moon to get insights about the recommended settings to gain maximum reliability, as I am not sure how especially librdkafka options interact with the acknowledgement mechanism.
Does commit_interval_ms interact with enable.auto.commit?
Is it still recommended to set auto offset reset to "earliest" if the priority is reliability compared to possible duplicate events in elastic?
Would increasing the interval improve reliability, or is it primarily for performance optimization?

If Vector crashes or a downstream sink fails (e.g., Elasticsearch or HTTP), how does Vector handle acknowledgements and ensure messages are not lost?
What are the best practices for configuring Vector and librdkafka to ensure that unprocessed messages are not acknowledged prematurely?

General Best Practices:

Are there any additional librdkafka settings or Vector configurations you recommend for achieving maximum reliability in a high-throughput Kafka-to-Vector pipeline?
Any insights, detailed explanations, or recommendations would be greatly appreciated!
And sorry for the wall of text and maybe stupid questions. I am a kind of "advocate" for vector in our current project (trying to get rid of as much logstash as possible) but I don't have an extensive kafka background/knowledge.
Thanks again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding vector cknowledgements and Kafka source configuration options for reliability #21949

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Understanding vector cknowledgements and Kafka source configuration options for reliability #21949

kgorskowski Dec 4, 2024

Replies: 0 comments

kgorskowski
Dec 4, 2024