Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional point weight file (pnt_wght.ww3.nc) for unstructured grid to speed up initialization #1333

Merged

Conversation

JessicaMeixner-NOAA
Copy link
Collaborator

@JessicaMeixner-NOAA JessicaMeixner-NOAA commented Dec 13, 2024

Pull Request Summary

This PR adds the ability to create and then use a point file to store the weights to speed up initialization when using unstructured grid. This is particularly needed for large unstructured grids with lots of point output.

Description

For unstructured grids, if a pnt_wght.ww3.nc file does not exist, it will write a file. (Note ww3 is replaced with grid name for multi-grid). Then on a subsequent run, if this file exists instead of performing a search for points, it will use this point list of files. When testing with a global unstructured grid with 15km resolution in the UFS weather model this sped up initialization time from 297.8026 to 18.3455s.

There are two dependencies to this PR, which are both satisfied as of 12/17/24

Issue(s) addressed

Fixes #1179

Commit Message

Optional point weight file (pnt_wght.ww3.nc) for unstructured grid to speed up initialization

Check list

Testing

  • How were these changes tested?
    First, confirmed that no answers were changed (/scratch1/NCEPDEV/climate/Jessica.Meixner/PR_WW3/pointssavelist02/regtests) Then saved off weight files for all unstructured grid tests and copied those into work directories (/scratch1/NCEPDEV/climate/Jessica.Meixner/PR_WW3/pointssavelist04/regtests/ copyfiles.sh) and then ensured that all answers replicated: /scratch1/NCEPDEV/climate/Jessica.Meixner/PR_WW3/pointssavelist04/regtests . Then I added a weight file to the tar file for 1 regtest for future testing and tested versus that. Those answers are what is shown below. I also merged these changes to dev/ufs-waether-model and confirmed we do get the initialization speed-up desired.
  • Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) yes
  • Have the matrix regression tests been run (if yes, please note HPC and compiler)? hera intel
  • Please indicate the expected changes in the regression test output, (Note the list of known non-identical tests.)
  • Please provide the summary output of matrix.comp (matrix.Diff.txt, matrixCompFull.txt and matrixCompSummary.txt):
**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR1_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2                     (13 files differ)
mww3_test_03/./work_PR1_MPI_d2                     (11 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (18 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (15 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2                     (17 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2                     (16 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2                     (16 files differ)
mww3_test_09/./work_MPI_ASCII                     (0 files differ)
ww3_tp2.10/./work_MPI_OMPH                     (6 files differ)
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)
ww3_tp2.17/./work_a                     (0 files differ)
ww3_tp2.17/./work_c                     (0 files differ)
ww3_tp2.17/./work_b                     (0 files differ)
ww3_tp2.19/./work_1B_a                     (0 files differ)
ww3_tp2.19/./work_1A_a                     (0 files differ)
ww3_tp2.19/./work_1C_a                     (0 files differ)
ww3_tp2.21/./work_ma                     (3 files differ)
ww3_tp2.21/./work_b_metis                     (0 files differ)
ww3_tp2.21/./work_a                     (0 files differ)
ww3_tp2.21/./work_mb                     (3 files differ)
ww3_tp2.21/./work_b                     (0 files differ)
ww3_tp2.6/./work_ST0                     (0 files differ)
ww3_tp2.6/./work_ST4                     (0 files differ)
ww3_tp2.6/./work_pdlib                     (0 files differ)
ww3_tp2.6/./work_ST4_ASCII                     (0 files differ)
ww3_tp2.7/./work_ST0                     (0 files differ)
ww3_ts4/./work_ug_MPI                     (0 files differ)
ww3_ufs1.1/./work_unstr_b                     (4 files differ)
ww3_ufs1.1/./work_unstr_a                     (4 files differ)
ww3_ufs1.1/./work_unstr_c                     (4 files differ)
ww3_ufs1.3/./work_a                     (3 files differ)

These have the expected non-b4b. Added files with "0 diff" because of new pnt weight output and the two tp2.21 will not be there when compared after #1325 is merged.

matrixCompSummary.txt
matrixCompFull.txt

Copy link
Collaborator

@MatthewMasarik-NOAA MatthewMasarik-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review
Pass

Testing
Pass

**********************************************************************          
********************* non-identical cases ****************************          
**********************************************************************          
mww3_test_03/./work_PR1_MPI_e                     (1 files differ)              
mww3_test_03/./work_PR3_UQ_MPI_e_c                     (1 files differ)         
mww3_test_03/./work_PR3_UNO_MPI_e                     (1 files differ)          
mww3_test_03/./work_PR2_UNO_MPI_e                     (1 files differ)          
mww3_test_03/./work_PR2_UNO_MPI_d2                     (12 files differ)        
mww3_test_03/./work_PR1_MPI_d2                     (12 files differ)            
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (15 files differ)      
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (16 files differ)       
mww3_test_03/./work_PR3_UNO_MPI_d2                     (17 files differ)        
mww3_test_03/./work_PR2_UQ_MPI_d2                     (16 files differ)         
mww3_test_03/./work_PR3_UQ_MPI_e                     (1 files differ)           
mww3_test_03/./work_PR3_UNO_MPI_e_c                     (1 files differ)        
mww3_test_03/./work_PR3_UQ_MPI_d2                     (15 files differ)         
mww3_test_09/./work_MPI_ASCII                     (0 files differ)              
ww3_tp2.10/./work_MPI_OMPH                     (6 files differ)                 
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)                 
ww3_tp2.17/./work_a                     (0 files differ)                        
ww3_tp2.17/./work_c                     (0 files differ)                        
ww3_tp2.17/./work_b                     (0 files differ)                        
ww3_tp2.19/./work_1B_a                     (0 files differ)                     
ww3_tp2.19/./work_1A_a                     (0 files differ)                     
ww3_tp2.19/./work_1C_a                     (0 files differ)                     
ww3_tp2.21/./work_b_metis                     (0 files differ)                  
ww3_tp2.21/./work_a                     (0 files differ)                        
ww3_tp2.21/./work_b                     (0 files differ)                        
ww3_tp2.6/./work_ST0                     (0 files differ)                       
ww3_tp2.6/./work_ST4                     (0 files differ)                       
ww3_tp2.6/./work_pdlib                     (0 files differ)                     
ww3_tp2.6/./work_ST4_ASCII                     (0 files differ)                 
ww3_tp2.7/./work_ST0                     (0 files differ)                       
ww3_ts4/./work_ug_MPI                     (0 files differ)                      
ww3_ufs1.1/./work_unstr_b                     (0 files differ)                  
ww3_ufs1.1/./work_unstr_a                     (0 files differ)                  
ww3_ufs1.1/./work_unstr_c                     (0 files differ)                  
ww3_ufs1.3/./work_a                     (3 files differ)                        
                                                                                
**********************************************************************          
************************ identical cases *****************************          
**********************************************************************

@MatthewMasarik-NOAA MatthewMasarik-NOAA merged commit e82df78 into NOAA-EMC:develop Dec 20, 2024
3 of 14 checks passed
@MatthewMasarik-NOAA
Copy link
Collaborator

Thanks @JessicaMeixner-NOAA, I'm glad to see this fix had such a big impact on performance!

@JessicaMeixner-NOAA JessicaMeixner-NOAA deleted the feature/savepointsunst branch January 3, 2025 20:39
@thesser1
Copy link
Collaborator

@JessicaMeixner-NOAA and @MatthewMasarik-NOAA, I was able to replicate my tests and all commits worked until I got to this PR. It looks like something in this PR is causing the issues with hanging on tp2.6, additionally, on my other HPC I am seeing a new error to me (below).

Rank 220 [Mon Jan 13 18:09:49 2025] [c1-0c1s12n1] Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack:
MPIR_CRAY_Bcast_Tree(405): message sizes do not match across processes in the collective routine: Received 1 but expected 18

I looked through the PR and there seems to be a decent amount to unravel that could be causing the issues.

@sbanihash
Copy link
Collaborator

@thesser1 @JessicaMeixner-NOAA I can also confirm that this PR is causing the stalling in tp2.6. Going back to commit 488e3c solves the issue. @JessicaMeixner-NOAA @MatthewMasarik-NOAA please let me know how to proceed.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@sbanihash We need to create an issue and try to fix the problem. @thesser1 - does this work for you?

@thesser1
Copy link
Collaborator

Yes, on our side, we will likely roll back a commit and continue working with our development until this is resolved. It is not ideal, but we have some boundary bugs that need testing before we lose momentum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an option to read from file which element a requested point output is in
4 participants