-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MR: Set minimal access permissions for each MR based on its usage #240
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, is this tested yet?
I have verified it runs successfully a PyTorch-FSDP workload and nccl-tests. Both with standard configuration and with configuration of ( I have also verified that when one of the MR permissions is wrong (e.g. removing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. We should probably also test SENDRECV protocol before committing.
Host buffer access permission was missing FI_REMOTE_READ to allow it to be used as a RMA read source buffer on GPU flush. CUDA buffer access permission was missing FI_READ to allow it to be used as a RMA read result buffer on eager local copy. Neuron buffer access permission was set to FI_REMOTE_READ rather than FI_READ to allow it to be used as a RMA read result buffer on eager local copy. In addition, Neuron buffer was unnecessary given FI_REMOTE_READ on send/recv protocol. Use this oppurtonity to also add documentation to the MR permissions setting code. It's worth mentioning that this issue was easily discovered by running with the following configuration: * FI_EFA_USE_DEVICE_RDMA=0 (libfabric disable EFA RDMA support) * OFI_NCCL_DISABLE_NATIVE_RDMA_CHECK=1 (Force plugin to use RMA even when RDMA write is not supported by HW) * FI_LOG_LEVEL=Warn (libfabric set log level to warning) Signed-off-by: Liran Alon <liran@amazon.com>
MR access permissions are curerntly set in the internal function that initialize MR attributes for all registered MRs (set_mr_req_attr() for RDMA protocol and register_mr_buffers() for SENDRECV protocol). Thus, this function is required to consider all the possible usages of registered MRs in the codebase to properly set the access permissions. First, this is prone to error because one may not consider some MR usage (See previous commit which fixes such a bug). Second, setting minimal access permissions for each MR based on its usage provides us some protection against mistakenly mixing MR handles. Resulting in such case in MR access permission violation rather than a memory corruption, which is easier to debug and root-cause. Signed-off-by: Liran Alon <liran@amazon.com>
Converting to draft until we sort out some internal testing failures. |
There have been sizable changes to the plugin since this PR was cut, so the PR would need to be reworked. It is still a good change to take in though. @liralon, can you refresh the PR? The CI issues have been largely resolved. |
Please merge master. |
This PR is composed of 2 commits.
The 1st commit fixes a bug in MR access permissions setting for RMA reads. It is a minimal change to the functions initializing the MR attributes that just fixes the issue.
The 2nd commit change the codebase such that the various MR users determines the specific MR access permissions they require rather than setting the MR permission overly permissive for all possible use cases. This is less error prone (As the bug fixed in 1st commit) and provides some protection on mistakenly confusing MR handles, as that would lead to MR access permission failure rather than potentially a memory corruption issue.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.