Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix synchronization bug for GPU stream async CPU work #1768

Merged
merged 1 commit into from
Jan 15, 2025
Merged

Conversation

awni
Copy link
Member

@awni awni commented Jan 14, 2025

Fixes synchronization bugs for asynchronous CPU work that happens in the GPU stream.

The basic premise is that the output events were getting signaled before all the work in the GPU stream actually completed.

For example:

a = mx.ones(5,) # a event value 1
x = mx.load() # x has event value 2
z = mx.all_sum(a) # wait on event value 1

The load would signal the event value to 2 without ensuring that a was actually finished causing the all_sum to start prematurely.

The fix is to simply make a new event for primitives like Load and the distributed ops that need to synchronize with the GPU.

@awni awni requested a review from angeloskath January 14, 2025 20:52
@awni awni changed the title Fix synchronization bug for in stream async works Fix synchronization bug for GPU stream async CPU work Jan 14, 2025
Copy link
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! That would have gone unnoticed for long...

if (in.event().valid()) {
encode_signal(in.event());
}
// TODO do we need an event wait as well?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is pretty tricky. Many comm backends have a synchronous/blocking and non-blocking version but I feel that this may be a bit much for now. So I think going without an event is better as it will allow us to better pipeline multiple sends or computation after send in general.

@awni awni merged commit f288db8 into main Jan 15, 2025
5 checks passed
@awni awni deleted the fix_synch_bug branch January 15, 2025 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants