Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Build conda and environments in SRW. #938

Merged
merged 50 commits into from
Nov 29, 2023

Conversation

christinaholtNOAA
Copy link
Collaborator

@christinaholtNOAA christinaholtNOAA commented Oct 10, 2023

DESCRIPTION OF CHANGES:

Modifies devbuild.sh to add the option to install miniforge (a version of miniconda that manages channels more strictly) in a user's specified location and defaults to inside the user's clone. It also installs two environments needed for SRW -- srw_app, which is similar to the old workflow_tools environment, and srw_graphics, which is sufficient to support the plotting scripts in SRW.

A few additional details:

  • Does the conda installation right away so that MacOS users have the necessary bash utilities like readlink to use the devbuild.sh script.
  • Requires the user to provide the conda target to build conda -- it doesn't build by default.
  • Adds a conda module file that points to the user-installed location of miniconda.
  • Modifies the GitHub Actions workflows to use the same environments that are built by devbuild.sh.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • hercules.intel
  • cheyenne.intel
  • cheyenne.gnu
  • derecho.intel
  • gaea.intel
  • gaeac5.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

Test suite is still pending on Hera. Edit: The fundamental test suite has passed on Hera.

I also tested the conda installation bits on MacOS, but did not carry out an entire build. I did confirm that the environments were installed and could support the readlink utility.

This needs tests on all platforms, most likely.

DEPENDENCIES:

None.

DOCUMENTATION:

I update the docs with this PR.

ISSUE:

N/A

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

Co-authored-by: Michael Lueken <63728921+MichaelLueken@users.noreply.github.com>
@christinaholtNOAA
Copy link
Collaborator Author

@MichaelLueken After giving this AQM issue some thought, I've reached out to @chan-hoo via email to ask some questions about whether AQM can tolerate this sort of change in develop. I will come back to the solution once we've had a chance to work out those kinks. Sorry for the delay here, and I'm glad you caught this!

@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA - Given @chan-hoo's reply to your email, I was able to successfully build your build_conda branch on Hera with the -a=ATMAQ option and was able to run the aqm_grid_AQM_NA13km_suite_GFS_v16 test:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16                                  COMPLETE            1083.12
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1083.12

What you currently have should be fine, but we will need to make sure that nothing is done to build the srw_app conda environment on WCOSS2.

@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA - Given @chan-hoo's reply to your email, I was able to successfully build your build_conda branch on Hera with the -a=ATMAQ option and was able to run the aqm_grid_AQM_NA13km_suite_GFS_v16 test:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16                                  COMPLETE            1083.12
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1083.12

What you currently have should be fine, but we will need to make sure that nothing is done to build the srw_app conda environment on WCOSS2.

@christinaholtNOAA - It should be noted that this test was ran using the current settings in modulefiles/tasks/hera/miniconda_regional_workflow_cmaq.lua.

I have attempted to run the AQM WE2E test once again, replacing:

prepend_path("MODULEPATH","/scratch1/NCEPDEV/nems/role.epic/miniconda3/modulefiles")
load(pathJoin("miniconda3", os.getenv("miniconda3_ver") or "4.12.0"))

setenv("SRW_ENV", "regional_workflow_cmaq")

with:

load("conda")
setenv("SRW_ENV", "srw_app")

This time, the test failed in the following tasks:
point_source, nexus_emission_00, nexus_emission_01, and nexus_emission_02

The following traceback is in the point_source log files:

Traceback (most recent call last):
  File "/scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/sorc/AQM-utils/python_utils/stack-pt-merge.py", line 12, in <module>
    import netCDF4 as nc
  File "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/py-netcdf4-1.5.3-ofq7pt3/lib/python3.9/site-packages/netCDF4/__init__.py", line 3, in <module>
    from ._netCDF4 import *
ModuleNotFoundError: No module named 'netCDF4._netCDF4'

The following traceback is in the nexus_emission_* log files:

Traceback (most recent call last):
  File "/scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/sorc/arl_nexus/utils/python/make_nexus_output_pretty.py", line 144, in <module>
    raise SystemExit(main(**parse_args()))
                     ^^^^^^^^^^^^^^^^^^^^
  File "/scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/sorc/arl_nexus/utils/python/make_nexus_output_pretty.py", line 50, in main
    import netCDF4 as nc
  File "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/py-netcdf4-1.5.3-ofq7pt3/lib/python3.9/site-packages/netCDF4/__init__.py", line 3, in <module>
    from ._netCDF4 import *
ModuleNotFoundError: No module named 'netCDF4._netCDF4'

So, continuing to use the current regional_workflow_cmaq conda environment works, but updating to srw_app leads to failures due to missing netCDF4 in the conda environment.

@christinaholtNOAA
Copy link
Collaborator Author

My apologies for the huge delay on getting back to the comments here. I have modified the code to install a srw_aqm environment when the AQM application is being built. I tested that on Hera and found that the test no longer worked -- data was unavailable. I'm now second-guessing that test though since it was an older version. I did confirm that the appropriate environment was being loaded for the tasks that were run.

I updated my branch to the top of develop, did a fresh build, and successfully ran the fundamental tests on Hera again.

@christinaholtNOAA
Copy link
Collaborator Author

I'm actually going to re-try that AQM test now.

@christinaholtNOAA
Copy link
Collaborator Author

I was able to more successfully run the AQM test and fixed the issue that @MichaelLueken reported above. My test is now failing in the aqm_lbcs task because it can't find data on Hera. Many other AQM tasks ran successfully and loaded the appropriate environment with NetCDF4.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christinaholtNOAA -

Thank you very much for addressing my concerns regarding WCOSS2 and the AQM WE2E test! Unfortunately, it looks like the old WE2E staged data was overwritten with a much newer case, so the aqm_lbcs task will fail. As you have noted, however, following your latest updates, all of the rest of the AQM-specific tasks appear to be running smoothly now, so I'm fine with these changes. I have noted one update to the scripts/exregional_make_grid.sh script that appears to have been made for debugging purposes. If this is the case, then it would be a good idea to go ahead and remove this line from the script.

I will go ahead and approve this work now and I will submit the Jenkins tests in the morning.

scripts/exregional_make_grid.sh Outdated Show resolved Hide resolved
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christinaholtNOAA -

Unfortunately, while working through the conversations and resolving those that I started, I noted the failure of the grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 fundamental WE2E test on Orion. A rerun of this test on Orion this morning shows that the test is still failing. It should be noted that this is only an issue on Orion, the rest of the RDHPCS machines successfully run the fundamental WE2E tests without issue.

The test is failing in the run_MET_Pb2nc_obs task with the following error message in the log:

/work/noaa/epic/role-epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2022.0.2/met-10.1.1-eiz3e5i/bin/pb2nc: error while loading shared libraries: libpython3.9.so.1.0: cannot open shared object file: No such file or directory

This error message is indicating that python needs to be used while running the verification tasks. Adding load("stack-python/3.9.7") back into build_orion_intel.lua allows the test to run through to completion without issue.

The current develop branch fundamental WE2E tests pass on Orion, so I will not be able to merge this PR until the issue with this test is resolved. Once this correction has been made, I will re-approve this PR and submit the automated Jenkins tests.

@christinaholtNOAA
Copy link
Collaborator Author

@MichaelLueken Thanks for all the help with debugging other platforms.

I've added back the stack-python that was removed for Orion, but did it in the run_vx local module file for just those tasks. I was noticing that the stack python environment was interacting poorly with the conda environment on some tasks (when netcdf is expected, for example), so want to limit the interaction between stack and conda environments as much as possible. It looks necessary for verification from this failure.

If you don't mind re-running that test again, it would be super helpful. Thanks!

@MichaelLueken
Copy link
Collaborator

Thanks, @christinaholtNOAA! Rerunning the test now.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christinaholtNOAA -

Thank you very much for working with me to get these final changes in! The grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 WE2E test is now successfully running on Orion:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              10.41
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              11.26
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               8.58
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              16.12
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR          COMPLETE              26.35
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0              COMPLETE              13.75
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              21.56
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             108.03

Once again approving this PR. I will go ahead and launch the Jenkins tests as well. I'd like for @gspetro-NOAA to provide one last look to ensure that she is happy, then I'll be able to merge this PR. Thanks again!

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Nov 27, 2023
@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA -

The Functional UnitTests have failed on Hera GNU, Hera Intel, and Jet. The failure appears to be due to the fact that the .cicd/scripts/srw_unittest.sh script, which runs the Functional UnitTests phase of the pipeline, occurs before the Build phase (no srw_app conda environment). It looks like the Functional UnitTests phase will need to be moved to after the Build phase in .cicd/Jenkinsfile in order to ensure that the necessary conda environment is available before running the unit tests.

Additionally, the Functional WorkflowTaskTests phase in the pipeline, which is ran using the .cicd/scripts/srw_ftest.sh script, failed because the conda activate is hardwired into this script:
conda activate workflow_tools
Please replace workflow_tools with srw_app in this script.

@MichaelLueken
Copy link
Collaborator

The We2E coverage tests were manually ran on Derecho and all successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_IndianOcean_6km                                     COMPLETE              24.18
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              38.53
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              46.45
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR           COMPLETE              44.70
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              18.82
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR                COMPLETE              41.38
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              24.88
pregen_grid_orog_sfc_climo                                         COMPLETE              17.04
specify_template_filenames                                         COMPLETE              19.63
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             275.61

@MichaelLueken
Copy link
Collaborator

The only failures in the automated Jenkins tests came from Hera Intel:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Central_Asia_3km                                    COMPLETE              24.51
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200          COMPLETE               6.04
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             760.78
get_from_HPSS_ics_HRRR_lbcs_RAP                                    COMPLETE              13.83
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        DEAD                   4.02
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     DEAD                   3.94
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE              10.43
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2        COMPLETE               6.22
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2         DEAD                 112.91
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16           COMPLETE             305.82
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             324.28
pregen_grid_orog_sfc_climo                                         COMPLETE               7.04
----------------------------------------------------------------------------------------------------
Total                                                              DEAD                1579.82

The use of rocotorewind/rocotoboot allowed the three tests to pass:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Central_Asia_3km                                    COMPLETE              24.51
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200          COMPLETE               6.04
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             760.78
get_from_HPSS_ics_HRRR_lbcs_RAP                                    COMPLETE              13.83
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               9.03
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              14.72
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE              10.43
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2        COMPLETE               6.22
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2         COMPLETE             236.20
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16           COMPLETE             305.82
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             324.28
pregen_grid_orog_sfc_climo                                         COMPLETE               7.04
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1718.90

I would like to check with @gspetro-NOAA to ensure that all of her concerns have been addressed, then I will move forward with merging this work.

Copy link
Collaborator

@gspetro-NOAA gspetro-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I ran the fundamental tests on Jet for good measure, and they pass. :)

@MichaelLueken MichaelLueken merged commit eecfbdd into ufs-community:develop Nov 29, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants