You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a kubernetes cluster with fluent-bit installed, if fluent bit's debug logging is enabled with a cloudwatch output, it appears to create an infinite loop with sending logs from Fluent Bit's pods to cloudwatch. This leads to accidentally incurring an very large AWS bill if not caught early and disabled.
To Reproduce
Example log message:
The vast majority of the logs were literally just this:
2025-01-02T19:06:09.722947399Z stderr F [2025/01/02 19:06:09] [debug] [output:cloudwatch_logs:cloudwatch_logs.6] Using stream=ip-XX-XX-XX-XX.us-west-2.compute.internal-application.var.log.containers.fluent-bit-mycluster-gf6ch_logging_fluent-bit-958d3960d1e3f0583ac7dde51da3b4a5a0a39f88a21cfbe9fd996edadaefd5af.log, group=/eks/mycluster/applications/logging.fluent-bit
Steps to reproduce the problem:
On an EKS cluster, set up fluent-bit logging to collect fluent-bit logs and forward them onto Cloudwatch while debug logging is enabled.
Expected behavior
It was expected that there'd be extra log verbosity and that this would lead to an increase in the amount of data sent to cloudwatch (and ergo, an increased bill from AWS). 500GB of logs every hour just for 9 or so fluent-bit pods was not expected.
The logs for non-fluent-bit pods were not a large issue, and the debug logging was helpful to determine the cause of the issue I was debugging at the time. It was only the logs for fluent-bit pods themselves that were producing such an significant increase in logs.
Screenshots
Your Environment
Version used: 3.2.2 (Docker URI: cr.fluentbit.io/fluent/fluent-bit:3.2.2)
Configuration:
[SERVICE]
flush 5
log_level debug
daemon off
Parsers_File /fluent-bit/etc/parsers.conf
Parsers_File /fluent-bit/etc/conf/custom_parsers.conf
http_server on
http_listen ::
storage.path /var/fluent-bit/state/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 5M
[INPUT]
Name tail
Tag application.*
Exclude_Path /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
Path /var/log/containers/*.log
Parser docker
DB /var/fluent-bit/state/flb_container.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Rotate_Wait 30
storage.type filesystem
Read_from_Head On
[INPUT]
Name tail
Tag application.*
Path /var/log/containers/fluent-bit*
Parser docker
DB /var/fluent-bit/state/flb_log.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head On
[FILTER]
Name kubernetes
Match application.*
Kube_URL {{.server}}:443
Kube_Tag_Prefix application.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
Labels Off
Annotations Off
Use_Kubelet On
Kubelet_Port 10250
Buffer_Size 0
tls.verify On
[OUTPUT]
Name cloudwatch_logs
Match application.*
region ${AWS_REGION}
log_group_name /eks/${CLUSTER_NAME}/fluent-bit/fallback-group
log_group_template /eks/${CLUSTER_NAME}/applications/$kubernetes['namespace_name'].$kubernetes['container_name']
auto_create_group true
log_stream_prefix ${HOST_NAME}-
workers 1
log_retention_days 731
Environment name and version (e.g. Kubernetes? What version?): AWS EKS Kubernetes 1.30
Server type and version:
Operating System and version: Bottlerocket OS 1.29
Filters and plugins: Tail Input, Kubernetes Filter, AWS Cloudwatch Output
Additional context
This was installed via an ArgoCD Application Set, using v0.48.3 of the Fluent-Bit Helm chart.
When setting up Fluent Bit on this development cluster, I was attempting to debug some log delivery issues and determine what the applied tags for certain inputs were. After testing, I forgot to turn off debug logging and didn't notice a significant daily AWS Cloudwatch spend until several days later. It seems like fluent bit kept on looping on creating a log stream while debug logging was turned on.
We have a similar configuration (without debug logging) enabled in several production clusters, and our average daily spend of all of those in total is a fraction of the average for this small, barely used development cluster over several days. This cluster only had 9 nodes, and yet somehow produced about 140Mb/sec or about 40GB of logs every 5m, compared to kilobytes with the same config but debug logging disabled.
This could have been avoided by earlier detection that debug logging was still enabled, as well as me using a much more limited fluent-bit config while it was enabled. I think I drasticially underestimated just how many logs would be produced.
The text was updated successfully, but these errors were encountered:
Bug Report
Describe the bug
On a kubernetes cluster with fluent-bit installed, if fluent bit's debug logging is enabled with a cloudwatch output, it appears to create an infinite loop with sending logs from Fluent Bit's pods to cloudwatch. This leads to accidentally incurring an very large AWS bill if not caught early and disabled.
To Reproduce
The vast majority of the logs were literally just this:
On an EKS cluster, set up fluent-bit logging to collect fluent-bit logs and forward them onto Cloudwatch while debug logging is enabled.
Expected behavior
It was expected that there'd be extra log verbosity and that this would lead to an increase in the amount of data sent to cloudwatch (and ergo, an increased bill from AWS). 500GB of logs every hour just for 9 or so fluent-bit pods was not expected.
The logs for non-fluent-bit pods were not a large issue, and the debug logging was helpful to determine the cause of the issue I was debugging at the time. It was only the logs for fluent-bit pods themselves that were producing such an significant increase in logs.
Screenshots
Your Environment
cr.fluentbit.io/fluent/fluent-bit:3.2.2
)Additional context
This was installed via an ArgoCD Application Set, using
v0.48.3
of the Fluent-Bit Helm chart.When setting up Fluent Bit on this development cluster, I was attempting to debug some log delivery issues and determine what the applied tags for certain inputs were. After testing, I forgot to turn off debug logging and didn't notice a significant daily AWS Cloudwatch spend until several days later. It seems like fluent bit kept on looping on creating a log stream while debug logging was turned on.
We have a similar configuration (without debug logging) enabled in several production clusters, and our average daily spend of all of those in total is a fraction of the average for this small, barely used development cluster over several days. This cluster only had 9 nodes, and yet somehow produced about 140Mb/sec or about 40GB of logs every 5m, compared to kilobytes with the same config but debug logging disabled.
This could have been avoided by earlier detection that debug logging was still enabled, as well as me using a much more limited fluent-bit config while it was enabled. I think I drasticially underestimated just how many logs would be produced.
The text was updated successfully, but these errors were encountered: