-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] Build conda and environments in SRW. #938
[develop] Build conda and environments in SRW. #938
Conversation
This reverts commit d3329d4.
Co-authored-by: Michael Lueken <63728921+MichaelLueken@users.noreply.github.com>
@MichaelLueken After giving this AQM issue some thought, I've reached out to @chan-hoo via email to ask some questions about whether AQM can tolerate this sort of change in develop. I will come back to the solution once we've had a chance to work out those kinks. Sorry for the delay here, and I'm glad you caught this! |
@christinaholtNOAA - Given @chan-hoo's reply to your email, I was able to successfully build your build_conda branch on Hera with the
What you currently have should be fine, but we will need to make sure that nothing is done to build the |
@christinaholtNOAA - It should be noted that this test was ran using the current settings in I have attempted to run the AQM WE2E test once again, replacing:
with:
This time, the test failed in the following tasks: The following traceback is in the
The following traceback is in the
So, continuing to use the current |
My apologies for the huge delay on getting back to the comments here. I have modified the code to install a I updated my branch to the top of develop, did a fresh build, and successfully ran the fundamental tests on Hera again. |
I'm actually going to re-try that AQM test now. |
I was able to more successfully run the AQM test and fixed the issue that @MichaelLueken reported above. My test is now failing in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for addressing my concerns regarding WCOSS2 and the AQM WE2E test! Unfortunately, it looks like the old WE2E staged data was overwritten with a much newer case, so the aqm_lbcs
task will fail. As you have noted, however, following your latest updates, all of the rest of the AQM-specific tasks appear to be running smoothly now, so I'm fine with these changes. I have noted one update to the scripts/exregional_make_grid.sh
script that appears to have been made for debugging purposes. If this is the case, then it would be a good idea to go ahead and remove this line from the script.
I will go ahead and approve this work now and I will submit the Jenkins tests in the morning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, while working through the conversations and resolving those that I started, I noted the failure of the grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16
fundamental WE2E test on Orion. A rerun of this test on Orion this morning shows that the test is still failing. It should be noted that this is only an issue on Orion, the rest of the RDHPCS machines successfully run the fundamental WE2E tests without issue.
The test is failing in the run_MET_Pb2nc_obs
task with the following error message in the log:
/work/noaa/epic/role-epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2022.0.2/met-10.1.1-eiz3e5i/bin/pb2nc: error while loading shared libraries: libpython3.9.so.1.0: cannot open shared object file: No such file or directory
This error message is indicating that python needs to be used while running the verification tasks. Adding load("stack-python/3.9.7")
back into build_orion_intel.lua
allows the test to run through to completion without issue.
The current develop
branch fundamental WE2E tests pass on Orion, so I will not be able to merge this PR until the issue with this test is resolved. Once this correction has been made, I will re-approve this PR and submit the automated Jenkins tests.
@MichaelLueken Thanks for all the help with debugging other platforms. I've added back the stack-python that was removed for Orion, but did it in the run_vx local module file for just those tasks. I was noticing that the stack python environment was interacting poorly with the conda environment on some tasks (when netcdf is expected, for example), so want to limit the interaction between stack and conda environments as much as possible. It looks necessary for verification from this failure. If you don't mind re-running that test again, it would be super helpful. Thanks! |
Thanks, @christinaholtNOAA! Rerunning the test now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for working with me to get these final changes in! The grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16
WE2E test is now successfully running on Orion:
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta COMPLETE 10.41
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_ COMPLETE 11.26
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 COMPLETE 8.58
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot COMPLETE 16.12
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR COMPLETE 26.35
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0 COMPLETE 13.75
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 COMPLETE 21.56
----------------------------------------------------------------------------------------------------
Total COMPLETE 108.03
Once again approving this PR. I will go ahead and launch the Jenkins tests as well. I'd like for @gspetro-NOAA to provide one last look to ensure that she is happy, then I'll be able to merge this PR. Thanks again!
The Additionally, the |
The We2E coverage tests were manually ran on Derecho and all successfully passed:
|
The only failures in the automated Jenkins tests came from Hera Intel:
The use of rocotorewind/rocotoboot allowed the three tests to pass:
I would like to check with @gspetro-NOAA to ensure that all of her concerns have been addressed, then I will move forward with merging this work. |
docs/UsersGuide/source/BuildingRunningTesting/ContainerQuickstart.rst
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I ran the fundamental tests on Jet for good measure, and they pass. :)
DESCRIPTION OF CHANGES:
Modifies
devbuild.sh
to add the option to install miniforge (a version of miniconda that manages channels more strictly) in a user's specified location and defaults to inside the user's clone. It also installs two environments needed for SRW --srw_app
, which is similar to the oldworkflow_tools
environment, andsrw_graphics
, which is sufficient to support the plotting scripts in SRW.A few additional details:
conda
target to build conda -- it doesn't build by default.Type of change
TESTS CONDUCTED:
Test suite is still pending on Hera.Edit: The fundamental test suite has passed on Hera.I also tested the conda installation bits on MacOS, but did not carry out an entire build. I did confirm that the environments were installed and could support the readlink utility.
This needs tests on all platforms, most likely.
DEPENDENCIES:
None.
DOCUMENTATION:
I update the docs with this PR.
ISSUE:
N/A
CHECKLIST