tests: fix pthread_join on stopped threads in tests. #10787

pwhelan · 2025-08-26T15:00:51Z

Summary

The ASAN (enabled with -DSANITIZE_ADDRESS=On) reports errors when pthread_join is called on the same thread twice. This happens during two different tests:

in_podman_metrics
in_netif

This is because both tests have configurations that fail on purpose, causing do_start to call pthread_join, which is called again when flb_stop is called on the same context.

This patch simply makes the call to flb_stop conditional on the return value of flb_start during the test.

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

Example configuration file for the change
Debug log output from testing the change

Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

Run local packaging test showing all targets (including any new ones) build.
Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

Documentation required for this feature

Backporting

Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

Tests
- Improved runtime test teardown to handle startup failures gracefully, stopping components only when startup succeeds and always cleaning up resources.
- Updated test cases to propagate startup results through teardown, increasing reliability and reducing false negatives.
Chores
- No user-facing changes; internal test stability improved.
Bug Fixes
- Prevented unnecessary stop operations during failed startup scenarios in tests, avoiding misleading errors.

…ctly. Signed-off-by: Phillip Whelan <phillip.whelan@chronosphere.io>

…art correctly. Signed-off-by: Phillip Whelan <phillip.whelan@chronosphere.io>

coderabbitai · 2025-08-26T15:01:00Z

Walkthrough

Test teardown logic changed to stop the Fluent Bit engine only when start succeeds. Functions now accept an int ret parameter from flb_start and conditionally call flb_stop based on ret == 0; destruction/freeing always occurs. Call sites in affected tests were updated accordingly.

Changes

Cohort / File(s)	Summary of Changes
Conditional teardown in runtime tests `tests/runtime/in_netif.c`	`test_ctx_destroy` updated to accept `int ret`; calls `flb_stop(ctx->flb)` only when `ret == 0`. All call sites now pass the start result to `test_ctx_destroy(ctx, ret)`.
Podman metrics teardown update `tests/runtime/in_podman_metrics.c`	`do_destroy` signature changed to `do_destroy(flb_ctx_t *ctx, int ret)`; stops only when `ret == 0`, always destroys. All tests capture `ret = flb_start(ctx)`, assert, and call `do_destroy(ctx, ret)`. No other test logic changed.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant T as Test
  participant FB as Fluent Bit Context

  T->>FB: flb_start(ctx)
  FB-->>T: int ret

  rect rgba(230, 245, 255, 0.5)
  note over T: Teardown
  alt ret == 0 (start succeeded)
    T->>FB: flb_stop(ctx)
  else ret != 0 (start failed)
    note over T,FB: Skip stop
  end
  T->>FB: flb_destroy(ctx)
  end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I twitch my ears at start and stop,
If engines purr, I gently drop;
If they refuse, I will not press—
I clean the burrow nonetheless.
Hop-safe teardown, neat and tight—
Carrots logged, all systems right. 🥕✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pwhelan-fix-flb-stop-failure-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tests/runtime/in_netif.c (1)
76-80: Avoid unnecessary delay on failure path; move sleep(1) under the success guard

When start fails, the extra one-second sleep isn’t needed. Move it inside the ret == 0 block so only successful-start paths wait before stopping.

Apply this diff inside test_ctx_destroy:
-    sleep(1);
-    if (ret == 0) {
-        flb_stop(ctx->flb);
-    }
+    if (ret == 0) {
+        /* allow pending flush before stopping; only relevant if engine started */
+        sleep(1);
+        flb_stop(ctx->flb);
+    }
tests/runtime/in_podman_metrics.c (1)
86-91: Document the guard to aid future maintenance

A short comment here will save future churn when someone wonders why stop is conditional.

Apply this doc-only tweak:
 void do_destroy(flb_ctx_t *ctx, int ret) {
-    if (ret == 0) {
+    /* Guard flb_stop(): on start failure, do_start() may have already joined the worker. */
+    if (ret == 0) {
         flb_stop(ctx);
     }
     flb_destroy(ctx);
 }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3254b9a and 855f617.

📒 Files selected for processing (2)

tests/runtime/in_netif.c (5 hunks)
tests/runtime/in_podman_metrics.c (9 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

tests/runtime/in_netif.c (1)

src/flb_lib.c (1)

flb_stop (942-985)

tests/runtime/in_podman_metrics.c (1)

src/flb_lib.c (3)

flb_stop (942-985)

flb_destroy (223-258)

flb_start (914-925)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)

GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
GitHub Check: PR - fuzzing test
GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
GitHub Check: pr-compile-centos-7
GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)

🔇 Additional comments (3)

tests/runtime/in_netif.c (1)

72-80: Guarding flb_stop() by start result prevents double pthread_join — good fix

This directly addresses the ASAN double-join by skipping flb_stop() when flb_start() returned non-zero. flb_stop() joins the worker even for FLB_LIB_NONE/ERROR, so calling it after a failed start that already joined the thread would indeed double-join. Solid change.

tests/runtime/in_podman_metrics.c (2)

86-91: Stop only when start succeeded — fixes duplicate pthread_join in failing tests

Conditionally calling flb_stop(ctx) on ret == 0 aligns teardown with flb_start()’s outcome and prevents the ASAN-reported double join. Order (stop → destroy) is correct.

93-109: Incomplete teardown propagation across test suite

The verification scan shows that call-site updates for the new teardown contract were not applied consistently:

Hundreds of tests still invoke test_ctx_destroy(ctx) with only the context, without propagating the ret value from flb_start() (e.g. in_kubernetes_events.c, filter_expect.c, many others).

Numerous tests continue to call flb_stop(ctx) (unconditionally) instead of relying on the conditional stop logic inside the updated do_destroy(ctx, ret) or test_ctx_destroy(ctx, ret) helpers.

Because of this inconsistency, the original approval of “uniform teardown enforcement” does not hold. Please propagate the new teardown pattern across all tests:

Update every test_ctx_destroy to accept and use the ret parameter, replacing direct flb_stop(ctx) calls.

Ensure all do_destroy helpers are called with both ctx and ret.

Remove any remaining unconditional flb_stop() from the test suite so that teardown behavior is centralized and ASAN duplicate-join reports are eliminated.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-08-26T15:07:45Z

tests/runtime/in_netif.c

@@ -226,7 +228,7 @@ static void flb_test_normal()
        TEST_MSG("expect: >=1 got: %d", got);
    }

-    test_ctx_destroy(ctx);
+    test_ctx_destroy(ctx, ret);


💡 Verification agent

🧩 Analysis chain

All call sites now propagate flb_start()’s return into teardown — consistent and correct

Passing ret into test_ctx_destroy(ctx, ret) ensures we only stop when the engine actually started. Call-site updates look complete for this file.

To ensure there are no remaining tests with the old pattern, run this repo-wide sweep:

Also applies to: 256-256, 290-290, 343-343

🏁 Script executed:

#!/usr/bin/env bash set -euo pipefail echo "1) Any direct flb_stop() in tests without an obvious guard?" rg -n -C2 -g 'tests/runtime/**' --type=c '\bflb_stop\s*$' echo echo "2) Common teardown helpers calling flb_stop(); check they accept start ret and guard the stop" rg -nU -g 'tests/runtime/**' --type=c '(?s)void\s+\w*(do_destroy|_destroy|_ctx_destroy)\s*\([^)]*$\s*\{.*?\bflb_stop\s*\(' -C0 echo echo "3) Call sites of teardown helpers; verify they pass the flb_start() result" rg -n -g 'tests/runtime/**' --type=c '\b(do_destroy|test_ctx_destroy)\s*\(' -C2 echo echo "4) Places that start but might still do unconditional flb_stop()" rg -n -C3 -g 'tests/runtime/**' --type=c '\bret\s*=\s*flb_start\s*\('

Length of output: 263775

Update Remaining Tests to Propagate flb_start() Return Into Teardown

While tests/runtime/in_netif.c has been updated correctly—its test_ctx_destroy(ctx, ret) guard ensures we only call flb_stop() when ret == 0—the repo-wide sweep still shows numerous test files using the old pattern. These need to be updated to maintain consistency and avoid unguarded calls to flb_stop().

Key areas requiring changes:

Any test in tests/runtime/** with an unconditional

flb_stop(ctx); // or flb_stop(ctx->flb); flb_destroy(ctx); // (or flb_destroy(ctx->flb);)

should instead

Use a teardown helper that accepts the ret value (e.g., test_ctx_destroy(ctx, ret) or do_destroy(ctx, ret)), and

Guard the call to flb_stop() inside that helper with if (ret == 0).

Teardown helper definitions (e.g., static void test_ctx_destroy(struct test_ctx *ctx)) must be refactored to accept an int ret parameter and wrap flb_stop() accordingly.

Examples of files to update (non-exhaustive):

tests/runtime/filter_parser.c (multiple flb_stop(ctx); at lines 162, 249, 326, …)

tests/runtime/out_flowcounter.c (lines 64, 100, 135, …)

tests/runtime/out_loki.c (lines 106, 165, 225, …)

tests/runtime/in_tail.c (lines 241, 551, 671, …)

tests/runtime/processor_metrics_selector.c (lines 144, 220, 302, …)

And many more under tests/runtime/**

Please extend the pattern used in in_netif.c across the suite:

- flb_stop(ctx); - flb_destroy(ctx); + test_ctx_destroy(ctx, ret); // or do_destroy(ctx, ret)

And update the helper:

-static void test_ctx_destroy(struct test_ctx *ctx) +static void test_ctx_destroy(struct test_ctx *ctx, int ret) { if (ret == 0) { flb_stop(ctx->flb); } flb_destroy(ctx->flb); }

With these changes, all tests will uniformly propagate and respect the return value of flb_start().

pwhelan added 2 commits August 26, 2025 10:54

tests/runtime: only call flb_stop for in_netif tests that start corre…

3d72321

…ctly. Signed-off-by: Phillip Whelan <phillip.whelan@chronosphere.io>

tests/runtime: only call flb_stop for in_podman_metrics tests that st…

855f617

…art correctly. Signed-off-by: Phillip Whelan <phillip.whelan@chronosphere.io>

pwhelan requested review from edsiper, fujimotos and koleini as code owners August 26, 2025 15:00

github-actions bot added the docs-required label Aug 26, 2025

pwhelan temporarily deployed to pr August 26, 2025 15:01 — with GitHub Actions Inactive

coderabbitai bot reviewed Aug 26, 2025

View reviewed changes

pwhelan temporarily deployed to pr August 26, 2025 15:18 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tests: fix pthread_join on stopped threads in tests. #10787

tests: fix pthread_join on stopped threads in tests. #10787

Uh oh!

pwhelan commented Aug 26, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Aug 26, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Aug 26, 2025

Uh oh!

Uh oh!

tests: fix pthread_join on stopped threads in tests. #10787

Are you sure you want to change the base?

tests: fix pthread_join on stopped threads in tests. #10787

Uh oh!

Conversation

pwhelan commented Aug 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pwhelan commented Aug 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 26, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)