Skip to content

Best practice: GPU & RDMA Joint Allocation #192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 18, 2025

Conversation

ferris-cx
Copy link
Contributor

Ⅰ. Describe what this PR does
Since Gpus in AI scenarios require RDMA computing nics for high-speed NCCL communication, end-to-end support for rdma devices must be added, including device discovery, device registration, node resource update, scheduling, and allocation.
Ⅱ. Does this pull request fix one issue?
No
Ⅲ. Describe how to verify it
Ⅳ. Special notes for reviews
V. Checklist
I have written necessary docs and comments
I have added necessary unit tests and integration tests
All checks passed in make test

@ZiMengSheng ZiMengSheng force-pushed the rdma-end2end branch 3 times, most recently from d68042b to 7da5817 Compare December 19, 2024 12:52
@saintube saintube changed the title Best practice Best practice: GPU & RDMA Joint Allocation Dec 20, 2024
@ZiMengSheng ZiMengSheng force-pushed the rdma-end2end branch 4 times, most recently from 2cc1a92 to 8bcbb07 Compare February 18, 2025 09:24
Signed-off-by: iostream2008@163.com <iostream2008@163.com>
Signed-off-by: wangjianyu <wangjianyu.wjy@alibaba-inc.com>
Copy link
Member

@saintube saintube left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@saintube saintube added the lgtm label Feb 18, 2025
@songtao98
Copy link
Contributor

/lgtm

@koordinator-bot koordinator-bot bot merged commit 1632b6e into koordinator-sh:main Feb 18, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants