No condition to check a single node saves the model in multiGPU training: might lead to saving corrupt model #1064
Unanswered
SwetaMahajan
asked this question in
Q&A
Replies: 1 comment
-
It's only saved on master open_clip/src/open_clip_train/main.py Line 490 in 7260a46 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
open_clip/src/open_clip_train/main.py
Lines 512 to 517 in 7260a46
Seems like there is no condition to check that a single node should save the model checkpoint in case of multi GPU training. This might lead to corrupt model checkpoint if all of the nodes try to save the same model at the same path. Can someone please explain.
Beta Was this translation helpful? Give feedback.
All reactions