Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix - predict_from_data_iterator tensor serialization issue #2728

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Kenneth-Schroeder
Copy link

I have encountered several fatal SIGBUS errors using TotalSegmentator which uses nnUnetV2 under the hood.
The errors occurred when running TotalSegmentator on large inputs.
After some digging, I was able to boil the issue down to the predict_from_data_iterator function of nnUNet, which passes the (in my case quite large) prediction result tensors to worker processes. For this step, the tensors need to be serialized, which apparently can cause memory alignment issues and trigger SIGBUS errors.
I noticed the error would disappear if I removed the multiprocessing logic, but numpy arrays seem much more stable regarding serialization and fixed the SIGBUS errors even with multiprocessing, hence this PR.

Here's a summary from Claude AI with slightly more details:

SIGBUS Error in PyTorch Multiprocessing: Claude (AI) Summary

The SIGBUS error occurred when passing PyTorch tensors between processes via multiprocessing. Here's what happened:

Root Cause

In the original function, we were serializing PyTorch tensors directly for multiprocessing:

# This line produces a PyTorch tensor
prediction = self.predict_logits_from_preprocessed_data(data).cpu()

# Problematic: Passing PyTorch tensor directly to worker process
future = executor.submit(
    export_prediction_from_logits,
    prediction, properties, self.configuration_manager, 
    # ...other arguments...
)

When PyTorch tensors are serialized and sent to child processes, their complex memory layout isn't perfectly preserved. This leads to memory alignment issues that trigger SIGBUS errors.

Solution

Converting the tensor to a NumPy array before passing it to the worker process:

# Convert to NumPy for safer serialization
prediction = self.predict_logits_from_preprocessed_data(data).cpu().numpy()

# Now safe: Passing NumPy array to worker process
future = executor.submit(
    export_prediction_from_logits,
    prediction, properties, self.configuration_manager, 
    # ...other arguments...
)

This fixes the issue by:

  • Converting the tensor to a simpler NumPy array format
  • Creating a copy with proper memory alignment
  • Eliminating PyTorch-specific memory structures that don't serialize well

The SIGBUS error only happens with multiprocessing because tensors must cross process boundaries through serialization/deserialization, which can't perfectly preserve their memory layout.

@FabianIsensee FabianIsensee self-assigned this Mar 4, 2025
@Kenneth-Schroeder
Copy link
Author

Hi @FabianIsensee, thanks for self-assigning.
This is a high-priority PR for us and should be quite simple to review.
If you could find some time to look at it soon, that would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants