Bugfix - predict_from_data_iterator tensor serialization issue #2728
+4
−12
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I have encountered several fatal SIGBUS errors using TotalSegmentator which uses nnUnetV2 under the hood.
The errors occurred when running TotalSegmentator on large inputs.
After some digging, I was able to boil the issue down to the predict_from_data_iterator function of nnUNet, which passes the (in my case quite large) prediction result tensors to worker processes. For this step, the tensors need to be serialized, which apparently can cause memory alignment issues and trigger SIGBUS errors.
I noticed the error would disappear if I removed the multiprocessing logic, but numpy arrays seem much more stable regarding serialization and fixed the SIGBUS errors even with multiprocessing, hence this PR.
Here's a summary from Claude AI with slightly more details:
SIGBUS Error in PyTorch Multiprocessing: Claude (AI) Summary
The SIGBUS error occurred when passing PyTorch tensors between processes via multiprocessing. Here's what happened:
Root Cause
In the original function, we were serializing PyTorch tensors directly for multiprocessing:
When PyTorch tensors are serialized and sent to child processes, their complex memory layout isn't perfectly preserved. This leads to memory alignment issues that trigger SIGBUS errors.
Solution
Converting the tensor to a NumPy array before passing it to the worker process:
This fixes the issue by:
The SIGBUS error only happens with multiprocessing because tensors must cross process boundaries through serialization/deserialization, which can't perfectly preserve their memory layout.