-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: support CXI in OFI for Slingshot-11 #3791
feature: support CXI in OFI for Slingshot-11 #3791
Conversation
This supports the CXI interface for Cassini (AKA Slingshot-11) within the OFI machine layer. No application level changes are required, but a variety of command line options are provided to configure the memory pool and the selection of cxi interfaces. Note: this does require the use of the memory pool in order to efficiently support the FI_MR_ENDPOINT mode of memory registration. As the time cost of registering individual messages would be otherwise be entirely too high. Charmrun has been configured to wrap srun and currently assumes PMI2 with cray extensions for launching. The build system has been set up to autodetect CXI and enable support for it accordingly. For compatibility purposes, it also supports the use of cxi on the build line, but that should not be necessary on most HPE systems with proper LMOD environments.
this allows xpmem to be supported in the build process for nonsmp targets. This has not yet proven beneficial for performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but obviously too big to be sure it's correct. Couple of small comments.
enable thread diagnostics in debug case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any documentation updates needed?
Has non-CXI OFI been tested and benchmarked for performance with the changes to use the LRTS mempool and the ofi request cache?
The last machine that I had access to with regular OFI was turned off last
year. I could try it over TCP/IP somewhere, but not sure how useful the
resulting information would be.
…On Fri, Apr 12, 2024 at 1:50 PM Sam White ***@***.***> wrote:
***@***.**** commented on this pull request.
Any documentation updates needed?
Has non-CXI OFI been tested and benchmarked for performance with the
changes to use the LRTS mempool and the ofi request cache?
------------------------------
In src/arch/ofi/conv-common.h
<#3791 (comment)>:
> /*
* Use Simple client-side implementation of PMI.
* Valid only for CMK_USE_PMI.
* Optional in an SLURM environment.
* See src/arch/util/proc_management/simple_pmi/
*/
-#define CMK_USE_SIMPLEPMI 1
+#define CMK_USE_SIMPLEPMI 0
Does this get set somewhere else now in the default OFI case?
—
Reply to this email directly, view it on GitHub
<#3791 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3HFHYNFRL7VAFMZ56YGBTY5AUHJAVCNFSM6AAAAABFHVV5UGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSOJYGIZTAMBSGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finally got a chance to look at this. everything looks all set to merge
# assume HPC installation | ||
include(CMakePrintHelpers) | ||
find_package(EnvModules REQUIRED) | ||
find_package(PkgConfig REQUIRED) | ||
if(EnvModules_FOUND) | ||
#at least get libfabric loaded if it isn't already | ||
env_module(load libfabric) | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this assumption is too restrictive. It does not seem to work when building Charm++ with Spack, since libfabric
is provided as a Spack package and not through modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense for the standard case of building on a DOE machine, but could present a problem for Spack. Is there is a simple way to test for the Spack case and then extract the necessary information to accomplish the same smooth build this code accomplishes with LMOD and PkgConfig?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the simplest way is to setup Charm++ as a develop build and introduce changes to the build system and Spack package until it works. I started working on it but got stuck and then other stuff took priority.
This supports the CXI interface for Cassini (AKA Slingshot-11) within the OFI machine layer. No application level changes are required, but a variety of command line options are provided to configure the memory pool and the selection of cxi interfaces.
Note: this does require the use of the memory pool in order to efficiently support the FI_MR_ENDPOINT mode of memory registration. As the time cost of registering individual messages would be otherwise be entirely too high.
Charmrun has been configured to wrap srun and currently assumes PMI2 with cray extensions for launching.
The build system has been set up to autodetect CXI and enable support for it accordingly. For compatibility purposes, it also supports the use of cxi on the build line, but that should not be necessary on most HPE systems with proper LMOD environments.