Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH enable faulthandler traceback reporting on worker crash by SIGSEV #419

Conversation

ogrisel
Copy link
Contributor

@ogrisel ogrisel commented Aug 1, 2024

Automatically call faulthandler.enable() by default to dump extra debug info to crashed workers' stderr, typically on SIGSEV, SIGBUS and co.

I also expanded the error message of TerminatedWorkerError to point the users to look at stderr for more details. This should help us debug stability problems on our CI but also assist our users and interact more productively with them when they report problems.

@ogrisel
Copy link
Contributor Author

ogrisel commented Aug 1, 2024

The pypy build deadlocked on commit 9d868e7 before reaching the new faulthandler test. Triggering CI again to see if it's likely to be related to this PR or not.

@ogrisel
Copy link
Contributor Author

ogrisel commented Aug 1, 2024

The pypy deadlock is probably unrelated to this PR, but let's push on more time to make sure.

@ogrisel
Copy link
Contributor Author

ogrisel commented Aug 1, 2024

I think this is ready for review @tomMoral and/or @lesteve.

@ogrisel
Copy link
Contributor Author

ogrisel commented Mar 3, 2025

I updated this PR and against the current master and it should be ready for merge. I think it can be helpful to get more informative feedback from users reporting crashes that are difficult to reproduce.

cc @jeremiedbb @lesteve @tomMoral.

Copy link
Contributor

@tomMoral tomMoral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this! Just a few comment for the form but otherwise LGTM.

A potential concern is with the file descriptor, but I guess this will not cause extra complexity and will work in most cases so I would say let's merge this one.

@ogrisel
Copy link
Contributor Author

ogrisel commented Mar 3, 2025

A potential concern is with the file descriptor, but I guess this will not cause extra complexity and will work in most cases so I would say let's merge this one.

Indeed, let's keep that in mind in case the expected output does not show up in case of a reported crash.

@ogrisel
Copy link
Contributor Author

ogrisel commented Mar 3, 2025

The b9f7f25 run failed with a timeout (deadlock?) in tests/test_process_executor_forkserver.py::TestsProcessPoolForkserverShutdown::test_shutdown_with_sys_exit_at_pickle:

https://github.com/joblib/loky/actions/runs/13635452625/job/38112993126?pr=419#step:5:200

This in turned caused the teardown_method to error for the 5 subsequent tests in the same module.

I have the feeling that this problem is unrelated to this PR, but let me push a few empty commits to see if this is more likely to happen here than on main.

@ogrisel
Copy link
Contributor Author

ogrisel commented Mar 3, 2025

@tomMoral I removed the debug logging in the worker in test because it broke the windows build in the CI of 52a1ff2.

@ogrisel
Copy link
Contributor Author

ogrisel commented Mar 4, 2025

This happened again on linux with Python 3.11 in the CI run for commit 925e522, so there might be something fishy with this PR.

Let's not merge it while we do not understand the source of the deadlock.

@ogrisel ogrisel marked this pull request as draft March 4, 2025 09:10
@ogrisel ogrisel force-pushed the automatically_enabled_faulthandler_on_loky_workers branch from f6d4a47 to bc43cdd Compare March 4, 2025 10:29
@ogrisel
Copy link
Contributor Author

ogrisel commented Mar 4, 2025

#438 confirms that the problem could be triggered with high probability on master by slightly changing the tests and that 67144d3 is the proper fix for this problem.

I believe this PR is ready for merge, by ideally after merging #438 and re-merging master into it.

@ogrisel ogrisel marked this pull request as ready for review March 4, 2025 15:30
@tomMoral tomMoral merged commit 36701db into joblib:master Mar 4, 2025
11 of 12 checks passed
@tomMoral
Copy link
Contributor

tomMoral commented Mar 4, 2025

Merging, thanks a lot for the PR, this will help debugging the next deadlocks! :)

@ogrisel ogrisel deleted the automatically_enabled_faulthandler_on_loky_workers branch March 10, 2025 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants