Skip to content

Conversation

LucasLLC
Copy link
Contributor

@LucasLLC LucasLLC commented Jul 2, 2024

Summary:
Distributed State Dict is the current suggested way from PyTorch for ensuring parallelized models state dicts are compatible with save/loads in Single process or re-sharding scenarios.

This diff updates dcp_saver to use DSD for DDP models. A good idea would be wrap all models in TNT with DSD, as this could replace some of the wrapper logic for FSDP and would guarantee future compat.

N5551629 also contains a workaround for current DDP model saved before this diff, by manually removing the "module." prefix in the checkpoint.

Differential Revision: D59234083

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59234083

Summary:
Pull Request resolved: #857

Distributed State Dict is the current suggested way from PyTorch for ensuring parallelized models state dicts are compatible with save/loads in Single process or re-sharding scenarios.

This diff updates dcp_saver to use DSD for DDP models. A good idea would be wrap all models in TNT with DSD, as this could replace some of the wrapper logic for FSDP and would guarantee future compat.

N5551629 also contains a workaround for current DDP model saved before this diff, by manually removing the "module." prefix in the checkpoint.

Differential Revision: D59234083
@LucasLLC LucasLLC force-pushed the export-D59234083 branch from 5818bb8 to 435b1cb Compare July 8, 2024 15:32
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59234083

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants