Skip to content

Conversation

pwhelan
Copy link
Contributor

@pwhelan pwhelan commented Aug 26, 2025

Summary

The ASAN (enabled with -DSANITIZE_ADDRESS=On) reports errors when pthread_join is called on the same thread twice. This happens during two different tests:

  • in_podman_metrics
  • in_netif

This is because both tests have configurations that fail on purpose, causing do_start to call pthread_join, which is called again when flb_stop is called on the same context.

This patch simply makes the call to flb_stop conditional on the return value of flb_start during the test.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • Tests

    • Improved runtime test teardown to handle startup failures gracefully, stopping components only when startup succeeds and always cleaning up resources.
    • Updated test cases to propagate startup results through teardown, increasing reliability and reducing false negatives.
  • Chores

    • No user-facing changes; internal test stability improved.
  • Bug Fixes

    • Prevented unnecessary stop operations during failed startup scenarios in tests, avoiding misleading errors.

…ctly.

Signed-off-by: Phillip Whelan <phillip.whelan@chronosphere.io>
…art correctly.

Signed-off-by: Phillip Whelan <phillip.whelan@chronosphere.io>
Copy link

coderabbitai bot commented Aug 26, 2025

Walkthrough

Test teardown logic changed to stop the Fluent Bit engine only when start succeeds. Functions now accept an int ret parameter from flb_start and conditionally call flb_stop based on ret == 0; destruction/freeing always occurs. Call sites in affected tests were updated accordingly.

Changes

Cohort / File(s) Summary of Changes
Conditional teardown in runtime tests
tests/runtime/in_netif.c
test_ctx_destroy updated to accept int ret; calls flb_stop(ctx->flb) only when ret == 0. All call sites now pass the start result to test_ctx_destroy(ctx, ret).
Podman metrics teardown update
tests/runtime/in_podman_metrics.c
do_destroy signature changed to do_destroy(flb_ctx_t *ctx, int ret); stops only when ret == 0, always destroys. All tests capture ret = flb_start(ctx), assert, and call do_destroy(ctx, ret). No other test logic changed.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant T as Test
  participant FB as Fluent Bit Context

  T->>FB: flb_start(ctx)
  FB-->>T: int ret

  rect rgba(230, 245, 255, 0.5)
  note over T: Teardown
  alt ret == 0 (start succeeded)
    T->>FB: flb_stop(ctx)
  else ret != 0 (start failed)
    note over T,FB: Skip stop
  end
  T->>FB: flb_destroy(ctx)
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I twitch my ears at start and stop,
If engines purr, I gently drop;
If they refuse, I will not press—
I clean the burrow nonetheless.
Hop-safe teardown, neat and tight—
Carrots logged, all systems right. 🥕✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pwhelan-fix-flb-stop-failure-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/runtime/in_netif.c (1)

76-80: Avoid unnecessary delay on failure path; move sleep(1) under the success guard

When start fails, the extra one-second sleep isn’t needed. Move it inside the ret == 0 block so only successful-start paths wait before stopping.

Apply this diff inside test_ctx_destroy:

-    sleep(1);
-    if (ret == 0) {
-        flb_stop(ctx->flb);
-    }
+    if (ret == 0) {
+        /* allow pending flush before stopping; only relevant if engine started */
+        sleep(1);
+        flb_stop(ctx->flb);
+    }
tests/runtime/in_podman_metrics.c (1)

86-91: Document the guard to aid future maintenance

A short comment here will save future churn when someone wonders why stop is conditional.

Apply this doc-only tweak:

 void do_destroy(flb_ctx_t *ctx, int ret) {
-    if (ret == 0) {
+    /* Guard flb_stop(): on start failure, do_start() may have already joined the worker. */
+    if (ret == 0) {
         flb_stop(ctx);
     }
     flb_destroy(ctx);
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3254b9a and 855f617.

📒 Files selected for processing (2)
  • tests/runtime/in_netif.c (5 hunks)
  • tests/runtime/in_podman_metrics.c (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/runtime/in_netif.c (1)
src/flb_lib.c (1)
  • flb_stop (942-985)
tests/runtime/in_podman_metrics.c (1)
src/flb_lib.c (3)
  • flb_stop (942-985)
  • flb_destroy (223-258)
  • flb_start (914-925)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
  • GitHub Check: PR - fuzzing test
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
  • GitHub Check: pr-compile-centos-7
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
🔇 Additional comments (3)
tests/runtime/in_netif.c (1)

72-80: Guarding flb_stop() by start result prevents double pthread_join — good fix

This directly addresses the ASAN double-join by skipping flb_stop() when flb_start() returned non-zero. flb_stop() joins the worker even for FLB_LIB_NONE/ERROR, so calling it after a failed start that already joined the thread would indeed double-join. Solid change.

tests/runtime/in_podman_metrics.c (2)

86-91: Stop only when start succeeded — fixes duplicate pthread_join in failing tests

Conditionally calling flb_stop(ctx) on ret == 0 aligns teardown with flb_start()’s outcome and prevents the ASAN-reported double join. Order (stop → destroy) is correct.


93-109: Incomplete teardown propagation across test suite

The verification scan shows that call-site updates for the new teardown contract were not applied consistently:

  • Hundreds of tests still invoke test_ctx_destroy(ctx) with only the context, without propagating the ret value from flb_start() (e.g. in_kubernetes_events.c, filter_expect.c, many others).
  • Numerous tests continue to call flb_stop(ctx) (unconditionally) instead of relying on the conditional stop logic inside the updated do_destroy(ctx, ret) or test_ctx_destroy(ctx, ret) helpers.

Because of this inconsistency, the original approval of “uniform teardown enforcement” does not hold. Please propagate the new teardown pattern across all tests:

  • Update every test_ctx_destroy to accept and use the ret parameter, replacing direct flb_stop(ctx) calls.
  • Ensure all do_destroy helpers are called with both ctx and ret.
  • Remove any remaining unconditional flb_stop() from the test suite so that teardown behavior is centralized and ASAN duplicate-join reports are eliminated.

Likely an incorrect or invalid review comment.

@@ -226,7 +228,7 @@ static void flb_test_normal()
TEST_MSG("expect: >=1 got: %d", got);
}

test_ctx_destroy(ctx);
test_ctx_destroy(ctx, ret);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

All call sites now propagate flb_start()’s return into teardown — consistent and correct

Passing ret into test_ctx_destroy(ctx, ret) ensures we only stop when the engine actually started. Call-site updates look complete for this file.

To ensure there are no remaining tests with the old pattern, run this repo-wide sweep:

Also applies to: 256-256, 290-290, 343-343


🏁 Script executed:

#!/usr/bin/env bash
set -euo pipefail

echo "1) Any direct flb_stop() in tests without an obvious guard?"
rg -n -C2 -g 'tests/runtime/**' --type=c '\bflb_stop\s*\('

echo
echo "2) Common teardown helpers calling flb_stop(); check they accept start ret and guard the stop"
rg -nU -g 'tests/runtime/**' --type=c '(?s)void\s+\w*(do_destroy|_destroy|_ctx_destroy)\s*\([^)]*\)\s*\{.*?\bflb_stop\s*\(' -C0

echo
echo "3) Call sites of teardown helpers; verify they pass the flb_start() result"
rg -n -g 'tests/runtime/**' --type=c '\b(do_destroy|test_ctx_destroy)\s*\(' -C2

echo
echo "4) Places that start but might still do unconditional flb_stop()"
rg -n -C3 -g 'tests/runtime/**' --type=c '\bret\s*=\s*flb_start\s*\('

Length of output: 263775


Update Remaining Tests to Propagate flb_start() Return Into Teardown

While tests/runtime/in_netif.c has been updated correctly—its test_ctx_destroy(ctx, ret) guard ensures we only call flb_stop() when ret == 0—the repo-wide sweep still shows numerous test files using the old pattern. These need to be updated to maintain consistency and avoid unguarded calls to flb_stop().

Key areas requiring changes:

  • Any test in tests/runtime/** with an unconditional

      flb_stop(ctx);       // or flb_stop(ctx->flb);
      flb_destroy(ctx);    // (or flb_destroy(ctx->flb);)

    should instead

    1. Use a teardown helper that accepts the ret value (e.g., test_ctx_destroy(ctx, ret) or do_destroy(ctx, ret)), and
    2. Guard the call to flb_stop() inside that helper with if (ret == 0).
  • Teardown helper definitions (e.g., static void test_ctx_destroy(struct test_ctx *ctx)) must be refactored to accept an int ret parameter and wrap flb_stop() accordingly.

Examples of files to update (non-exhaustive):

  • tests/runtime/filter_parser.c (multiple flb_stop(ctx); at lines 162, 249, 326, …)
  • tests/runtime/out_flowcounter.c (lines 64, 100, 135, …)
  • tests/runtime/out_loki.c (lines 106, 165, 225, …)
  • tests/runtime/in_tail.c (lines 241, 551, 671, …)
  • tests/runtime/processor_metrics_selector.c (lines 144, 220, 302, …)
  • And many more under tests/runtime/**

Please extend the pattern used in in_netif.c across the suite:

-    flb_stop(ctx);
-    flb_destroy(ctx);
+    test_ctx_destroy(ctx, ret);   // or do_destroy(ctx, ret)

And update the helper:

-static void test_ctx_destroy(struct test_ctx *ctx)
+static void test_ctx_destroy(struct test_ctx *ctx, int ret)
 {
     if (ret == 0) {
         flb_stop(ctx->flb);
     }
     flb_destroy(ctx->flb);
 }

With these changes, all tests will uniformly propagate and respect the return value of flb_start().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant