Experiences with vector as Logstash replacement in a high throughput Filebeat/Auditbeat env #21545
Unanswered
kgorskowski
asked this question in
Q&A
Replies: 1 comment 1 reply
-
One thing you could try is, instead of using the |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everyone,
I'm currently exploring vector as a potential drop-in replacement for logstash in a central log shipping environment. w
The customer has about 1000 clients running filebeat and auditbeat that are continuously sending logs to two logstash "aggregators".
For numerous reasons we would like to get rid of the logstashes and are looking at vector for help.
I've encountered some performance and stability issues related to the high number of connections being opened and closed by the clients, and I’m looking for advice on tuning the setup.
Current Setup:
Log Sources: Around 1000 Filebeat and Auditbeat clients configured on logstash output, so we were expecting about 2000 connections to the vector (+/- overhead)
Log Format: JSON
Vector Source: logstash source receiving data using the Logstash/Lumberjack protocol
Downstream: Events sent via HTTP over a "one way" gateway to a vector HTTP ingest
The Problem:
The connections between the beat clients and the vector instance seem very very noisy, unstable and flappy. A lot of connections are getting created and closed in short periods of time, up to hitting the file descriptor limits before we limited the max_connections to 4000.
Many of these connections are stuck in the CLOSE_WAIT state for extended periods and multiple warnings and errors on the beats and the vector, from connection reset by peer, "framing errors", being unable to write acknowledgements. The whole monty.
A lot of events are actually coming through but in a lot of cases we have a huge backlog in the beats cliients with thousands of events in a seemingly endless retry loop.
This was less of a problem when using Logstash, but it seems more pronounced after switching to Vector.
Both Vector and the Filebeat instances are running in the same network and can connect directly, so issues with infrastructure in between are unlikely.
I’ve experimented the connection, timeout and buffer-related parameters in the Vector config and started looking into the TCP settings on the machine (RHEL host), but the number of opening and closing connections remain a mystery to me.
Question:
Has anyone experience in scaling a vector to handle a similar number of Beats clients? Specifically, I’m looking for guidance on lumberjack specific protocol quirks or TCP settings I could look at.
btw vector 0.41.1 so latest release.
I am thankful for any insights, otherwise we will probably be stuck with logstash at the moment
Beta Was this translation helpful? Give feedback.
All reactions