Debug Logging to Cloudwatch causing logging loop for fluent-bit pods #9807

aolgin · 2025-01-07T21:52:03Z

Bug Report

Describe the bug

On a kubernetes cluster with fluent-bit installed, if fluent bit's debug logging is enabled with a cloudwatch output, it appears to create an infinite loop with sending logs from Fluent Bit's pods to cloudwatch. This leads to accidentally incurring an very large AWS bill if not caught early and disabled.

To Reproduce

Example log message:
The vast majority of the logs were literally just this:

2025-01-02T19:06:09.722947399Z stderr F [2025/01/02 19:06:09] [debug] [output:cloudwatch_logs:cloudwatch_logs.6] Using stream=ip-XX-XX-XX-XX.us-west-2.compute.internal-application.var.log.containers.fluent-bit-mycluster-gf6ch_logging_fluent-bit-958d3960d1e3f0583ac7dde51da3b4a5a0a39f88a21cfbe9fd996edadaefd5af.log, group=/eks/mycluster/applications/logging.fluent-bit

Steps to reproduce the problem:
On an EKS cluster, set up fluent-bit logging to collect fluent-bit logs and forward them onto Cloudwatch while debug logging is enabled.

Expected behavior

It was expected that there'd be extra log verbosity and that this would lead to an increase in the amount of data sent to cloudwatch (and ergo, an increased bill from AWS). 500GB of logs every hour just for 9 or so fluent-bit pods was not expected.

The logs for non-fluent-bit pods were not a large issue, and the debug logging was helpful to determine the cause of the issue I was debugging at the time. It was only the logs for fluent-bit pods themselves that were producing such an significant increase in logs.

Screenshots

Your Environment

Version used: 3.2.2 (Docker URI: cr.fluentbit.io/fluent/fluent-bit:3.2.2)
Configuration:

[SERVICE]
    flush                     5
    log_level                 debug
    daemon                    off
    Parsers_File              /fluent-bit/etc/parsers.conf
    Parsers_File              /fluent-bit/etc/conf/custom_parsers.conf
    http_server               on
    http_listen               ::
    storage.path              /var/fluent-bit/state/flb-storage/
    storage.sync              normal
    storage.checksum          off
    storage.backlog.mem_limit 5M
[INPUT]
    Name                tail
    Tag                 application.*
    Exclude_Path        /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
    Path                /var/log/containers/*.log
    Parser              docker
    DB                  /var/fluent-bit/state/flb_container.db
    Mem_Buf_Limit       50MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Rotate_Wait         30
    storage.type        filesystem
    Read_from_Head      On
[INPUT]
    Name                tail
    Tag                 application.*
    Path                /var/log/containers/fluent-bit*
    Parser              docker
    DB                  /var/fluent-bit/state/flb_log.db
    Mem_Buf_Limit       50MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Read_from_Head      On
[FILTER]
    Name                kubernetes
    Match               application.*
    Kube_URL            {{.server}}:443
    Kube_Tag_Prefix     application.var.log.containers.
    Merge_Log           On
    Merge_Log_Key       log_processed
    K8S-Logging.Parser  On
    K8S-Logging.Exclude Off
    Labels              Off
    Annotations         Off
    Use_Kubelet         On
    Kubelet_Port        10250
    Buffer_Size         0
    tls.verify          On
[OUTPUT]
    Name                cloudwatch_logs
    Match               application.*
    region              ${AWS_REGION}
    log_group_name      /eks/${CLUSTER_NAME}/fluent-bit/fallback-group
    log_group_template  /eks/${CLUSTER_NAME}/applications/$kubernetes['namespace_name'].$kubernetes['container_name']
    auto_create_group   true
    log_stream_prefix   ${HOST_NAME}-
    workers             1
    log_retention_days  731

Environment name and version (e.g. Kubernetes? What version?): AWS EKS Kubernetes 1.30
Server type and version:
Operating System and version: Bottlerocket OS 1.29
Filters and plugins: Tail Input, Kubernetes Filter, AWS Cloudwatch Output

Additional context
This was installed via an ArgoCD Application Set, using v0.48.3 of the Fluent-Bit Helm chart.

When setting up Fluent Bit on this development cluster, I was attempting to debug some log delivery issues and determine what the applied tags for certain inputs were. After testing, I forgot to turn off debug logging and didn't notice a significant daily AWS Cloudwatch spend until several days later. It seems like fluent bit kept on looping on creating a log stream while debug logging was turned on.

We have a similar configuration (without debug logging) enabled in several production clusters, and our average daily spend of all of those in total is a fraction of the average for this small, barely used development cluster over several days. This cluster only had 9 nodes, and yet somehow produced about 140Mb/sec or about 40GB of logs every 5m, compared to kilobytes with the same config but debug logging disabled.

This could have been avoided by earlier detection that debug logging was still enabled, as well as me using a much more limited fluent-bit config while it was enabled. I think I drasticially underestimated just how many logs would be produced.

The text was updated successfully, but these errors were encountered:

aolgin added the status: waiting-for-triage label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug Logging to Cloudwatch causing logging loop for fluent-bit pods #9807

Debug Logging to Cloudwatch causing logging loop for fluent-bit pods #9807

aolgin commented Jan 7, 2025

Debug Logging to Cloudwatch causing logging loop for fluent-bit pods #9807

Debug Logging to Cloudwatch causing logging loop for fluent-bit pods #9807

Comments

aolgin commented Jan 7, 2025

Bug Report