Experiences with vector as Logstash replacement in a high throughput Filebeat/Auditbeat env #21545

kgorskowski · 2024-10-17T22:35:25Z

kgorskowski
Oct 17, 2024

Hi everyone,

I'm currently exploring vector as a potential drop-in replacement for logstash in a central log shipping environment. w
The customer has about 1000 clients running filebeat and auditbeat that are continuously sending logs to two logstash "aggregators".
For numerous reasons we would like to get rid of the logstashes and are looking at vector for help.
I've encountered some performance and stability issues related to the high number of connections being opened and closed by the clients, and I’m looking for advice on tuning the setup.

Current Setup:
Log Sources: Around 1000 Filebeat and Auditbeat clients configured on logstash output, so we were expecting about 2000 connections to the vector (+/- overhead)
Log Format: JSON
Vector Source: logstash source receiving data using the Logstash/Lumberjack protocol
Downstream: Events sent via HTTP over a "one way" gateway to a vector HTTP ingest

The Problem:
The connections between the beat clients and the vector instance seem very very noisy, unstable and flappy. A lot of connections are getting created and closed in short periods of time, up to hitting the file descriptor limits before we limited the max_connections to 4000.

Many of these connections are stuck in the CLOSE_WAIT state for extended periods and multiple warnings and errors on the beats and the vector, from connection reset by peer, "framing errors", being unable to write acknowledgements. The whole monty.
A lot of events are actually coming through but in a lot of cases we have a huge backlog in the beats cliients with thousands of events in a seemingly endless retry loop.
This was less of a problem when using Logstash, but it seems more pronounced after switching to Vector.

Both Vector and the Filebeat instances are running in the same network and can connect directly, so issues with infrastructure in between are unlikely.
I’ve experimented the connection, timeout and buffer-related parameters in the Vector config and started looking into the TCP settings on the machine (RHEL host), but the number of opening and closing connections remain a mystery to me.

Question:
Has anyone experience in scaling a vector to handle a similar number of Beats clients? Specifically, I’m looking for guidance on lumberjack specific protocol quirks or TCP settings I could look at.
btw vector 0.41.1 so latest release.
I am thankful for any insights, otherwise we will probably be stuck with logstash at the moment

jszwedko · 2024-10-18T22:55:39Z

jszwedko
Oct 18, 2024
Maintainer

One thing you could try is, instead of using the logastash Vector source, use the http_server source and the http beats output. I've seen better throughput like that.

1 reply

kgorskowski Oct 19, 2024
Author

Thanks for your answer.
Yeah, the logstash/lumberjack protocol has somewhat been forced on us. On the one hand we have no direct control over the sending agents and implement a change there is probably not an option at the moment. So for that reason we tried to use vector as a drop in replacement that needed no change up- and downstream. On the other hand the plan is to migrate to fleet managed elastic agents in the near future and with that you only can define elastic, kafka, and logstash outputs. No HTTP option available unfortunately.
Yesterday I switched back to logstash and all beats connected with a fairly stable tcp connection as I would expect. So I guess the OS TCP settings are not entirely to blame.
So I guess for now we stay with logstash at this place (but I already replaced a good part of infra on our side with vector and I am loving it)
Thanks again

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiences with vector as Logstash replacement in a high throughput Filebeat/Auditbeat env #21545

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Experiences with vector as Logstash replacement in a high throughput Filebeat/Auditbeat env #21545

kgorskowski Oct 17, 2024

Replies: 1 comment · 1 reply

jszwedko Oct 18, 2024 Maintainer

kgorskowski Oct 19, 2024 Author

kgorskowski
Oct 17, 2024

Replies: 1 comment 1 reply

jszwedko
Oct 18, 2024
Maintainer

kgorskowski Oct 19, 2024
Author