Bugfix - predict_from_data_iterator tensor serialization issue #2728

Kenneth-Schroeder · 2025-03-04T14:38:31Z

I have encountered several fatal SIGBUS errors using TotalSegmentator which uses nnUnetV2 under the hood.
The errors occurred when running TotalSegmentator on large inputs.
After some digging, I was able to boil the issue down to the predict_from_data_iterator function of nnUNet, which passes the (in my case quite large) prediction result tensors to worker processes. For this step, the tensors need to be serialized, which apparently can cause memory alignment issues and trigger SIGBUS errors.
I noticed the error would disappear if I removed the multiprocessing logic, but numpy arrays seem much more stable regarding serialization and fixed the SIGBUS errors even with multiprocessing, hence this PR.

Here's a summary from Claude AI with slightly more details:

SIGBUS Error in PyTorch Multiprocessing: Claude (AI) Summary

The SIGBUS error occurred when passing PyTorch tensors between processes via multiprocessing. Here's what happened:

Root Cause

In the original function, we were serializing PyTorch tensors directly for multiprocessing:

# This line produces a PyTorch tensor
prediction = self.predict_logits_from_preprocessed_data(data).cpu()

# Problematic: Passing PyTorch tensor directly to worker process
future = executor.submit(
    export_prediction_from_logits,
    prediction, properties, self.configuration_manager, 
    # ...other arguments...
)

When PyTorch tensors are serialized and sent to child processes, their complex memory layout isn't perfectly preserved. This leads to memory alignment issues that trigger SIGBUS errors.

Solution

Converting the tensor to a NumPy array before passing it to the worker process:

# Convert to NumPy for safer serialization
prediction = self.predict_logits_from_preprocessed_data(data).cpu().numpy()

# Now safe: Passing NumPy array to worker process
future = executor.submit(
    export_prediction_from_logits,
    prediction, properties, self.configuration_manager, 
    # ...other arguments...
)

This fixes the issue by:

Converting the tensor to a simpler NumPy array format
Creating a copy with proper memory alignment
Eliminating PyTorch-specific memory structures that don't serialize well

The SIGBUS error only happens with multiprocessing because tensors must cross process boundaries through serialization/deserialization, which can't perfectly preserve their memory layout.

…t_from_data_iterator

… not needed either

Kenneth-Schroeder · 2025-03-10T08:42:20Z

Hi @FabianIsensee, thanks for self-assigning.
This is a high-priority PR for us and should be quite simple to review.
If you could find some time to look at it soon, that would be great.

Kenneth-Schroeder added 2 commits March 4, 2025 13:08

fix memory alignment issues with large tensor serialization in predic…

fe6a407

…t_from_data_iterator

increase minor version and add .copy() for safety

4b2a0c6

Kenneth-Schroeder mentioned this pull request Mar 4, 2025

SIGBUS errors (nnUNet PR) wasserth/TotalSegmentator#438

Open

remove .copy(), keep .detach() as reminder even though it is probably…

ac923ac

… not needed either

FabianIsensee self-assigned this Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix - predict_from_data_iterator tensor serialization issue #2728

Bugfix - predict_from_data_iterator tensor serialization issue #2728

Kenneth-Schroeder commented Mar 4, 2025

Kenneth-Schroeder commented Mar 10, 2025

Bugfix - predict_from_data_iterator tensor serialization issue #2728

Are you sure you want to change the base?

Bugfix - predict_from_data_iterator tensor serialization issue #2728

Conversation

Kenneth-Schroeder commented Mar 4, 2025

SIGBUS Error in PyTorch Multiprocessing: Claude (AI) Summary

Root Cause

Solution

Kenneth-Schroeder commented Mar 10, 2025