Questions on chap05 10-llm-training-speed Multi gpu script #569

STEVENTAN100 · 2025-03-15T03:49:59Z

STEVENTAN100
Mar 15, 2025

I've run 02_opt_multi_gpu_dpp.py on A100 gpu, with CUDA_VISIBLE_DEVICES=1,2, but found only 1 GPU works.
Hence I went back to run the appendix-A code both DDP and the verion with torchrun, and they works fine on multiple GPUs.
Looking forward to your kindly reply!

Answered by rasbt

Mar 24, 2025

The code appears in the appendix-A/DDP-script, but it doesn't in https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/10_llm-training-speed/02_opt_multi_gpu_dpp.py, and I think that's the reason why the multi-gpu didn't work.

Oh I see. Thanks for clarifying. Yes, that's correct. I only focused on the torchrun code here to keep the code differences more minimal. Since most people use torchrun, and it is also the "PyTorch officially recommended way" I was planning to only recommend that as well moving forward. (Also, I didn't want to mix & match mp.spawn code for people who use torchrun, and I think it's just easier to let torchrun handle it).

The README (https://github.com/rasbt/LLM…

View full answer

rasbt · 2025-03-15T14:57:42Z

rasbt
Mar 15, 2025
Maintainer

Hi there,

thanks for the command. And hm, that's weird!

I know you probably don't want to train on the first GPU, but could you try

CUDA_VISIBLE_DEVICES=0,1

10 replies

STEVENTAN100 Mar 23, 2025
Author

Hi there, would you mind clarifying here. When you say "missing the code" do you mean you forgot it in your script? Because I just double checked and the
world_size = torch.cuda.device_count()
mp.spawn(main, args=(world_size, args.save_every,args.total_epochs, args.batch_size), nprocs=world_size)
is both on GitHub and the book. Or am I missing something here?

The code appears in the appendix-A/DDP-script, but it doesn't in https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/10_llm-training-speed/02_opt_multi_gpu_dpp.py, and I think that's the reason why the multi-gpu didn't work.

d-kleine Mar 24, 2025

Yeah, I agree that seems to be missing in the code for the multi-gpu training:

LLMs-from-scratch/ch05/10_llm-training-speed/02_opt_multi_gpu_dpp.py

Lines 548 to 561 in feb1e9a

    
           if __name__ == "__main__": 
        
               # NEW: Extract rank and world size from environment variables 
        
               if "WORLD_SIZE" in os.environ: 
        
                   world_size = int(os.environ["WORLD_SIZE"]) 
        
               else: 
        
                   world_size = 1 
        
               if "LOCAL_RANK" in os.environ: 
        
                   rank = int(os.environ["LOCAL_RANK"]) 
        
               elif "RANK" in os.environ: 
        
                   rank = int(os.environ["RANK"]) 
        
               else: 
        
                   rank = 0

But I think this should get resolved with the torchrun (elastic launch) command and the --nproc_per_node param as a value > 1 would launch processes in each one of all GPUs.

@rasbt BTW the script should be renamed to 02_opt_multi_gpu_ddp.py (not dpp)

I just also ran the single gpu training locally on my GPU - torch.compile(model) does not work on Windows as this uses triton, but there is still no Windows wheel for this package unfortunately. Therefore, Windows user needs to switch to WSL or switching to triton-windows, an unofficial fork for triton that provides Windows wheels. This affect both the single-gpu as well as the multi-gpu training and might be added to the readme.

d-kleine Mar 24, 2025

@STEVENTAN100 Did you run the script via python 02_opt_multi_gpu_dpp.py or torchrun --nproc_per_node=4 02_opt_multi_gpu_dpp.py? If you have used the first command, this would explain why it's only using one GPU. From your screenshot above, you have 8 devices listed, so you could try the above torchrun command but with using --nproc_per_node=8.

rasbt Mar 24, 2025
Maintainer

The code appears in the appendix-A/DDP-script, but it doesn't in https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/10_llm-training-speed/02_opt_multi_gpu_dpp.py, and I think that's the reason why the multi-gpu didn't work.

Oh I see. Thanks for clarifying. Yes, that's correct. I only focused on the torchrun code here to keep the code differences more minimal. Since most people use torchrun, and it is also the "PyTorch officially recommended way" I was planning to only recommend that as well moving forward. (Also, I didn't want to mix & match mp.spawn code for people who use torchrun, and I think it's just easier to let torchrun handle it).

The README (https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/10_llm-training-speed) already mentions torchrun but maybe I can make it more clear saying that it's the only way to run this script.

Answer selected by STEVENTAN100

STEVENTAN100 Mar 24, 2025
Author

@rasbt @d-kleine Thanks for your kind help! It's my carelessness to ignore the torchrun command, after I checked it out.

d-kleine Mar 24, 2025

@STEVENTAN100 No worries, this happens to all of us 🙂 It was super helpful including your screenshots in your report to narrow down the issue. And I'm sure other users will benefit from the improvements in #578

By the way, did the torchrun command work for you as expected to utilize all GPUs with the 02_opt_multi_gpu_ddp.py script?

@rasbt You might add to #578 that the value for --nproc_per_node should correspond the num of GPUs available and might be changed due to the individual setup. Currently, the command in the text uses --nproc_per_node=4, but this value would be problematic for a multi-gpu setup with only 2 or 3 GPUs.

Additionally, it would be helpful to include a comment in the compile section about the triton issue for Windows users.

STEVENTAN100 Mar 25, 2025
Author

@STEVENTAN100 No worries, this happens to all of us 🙂 It was super helpful including your screenshots in your report to narrow down the issue. And I'm sure other users will benefit from the improvements in #578

By the way, did the torchrun command work for you as expected to utilize all GPUs with the 02_opt_multi_gpu_ddp.py script?

@rasbt You might add to #578 that the value for --nproc_per_node should correspond the num of GPUs available and might be changed due to the individual setup. Currently, the command in the text uses --nproc_per_node=4, but this value would be problematic for a multi-gpu setup with only 2 or 3 GPUs.

Additionally, it would be helpful to include a comment in the compile section about the `triton issue for Windows users.

Yes, it worked!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on chap05 10-llm-training-speed Multi gpu script #569

{{title}}

Replies: 1 comment 10 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Questions on chap05 10-llm-training-speed Multi gpu script #569

STEVENTAN100 Mar 15, 2025

Replies: 1 comment · 10 replies

rasbt Mar 15, 2025 Maintainer

STEVENTAN100 Mar 23, 2025 Author

d-kleine Mar 24, 2025

d-kleine Mar 24, 2025

rasbt Mar 24, 2025 Maintainer

STEVENTAN100 Mar 24, 2025 Author

d-kleine Mar 24, 2025

STEVENTAN100 Mar 25, 2025 Author

STEVENTAN100
Mar 15, 2025

Replies: 1 comment 10 replies

rasbt
Mar 15, 2025
Maintainer

STEVENTAN100 Mar 23, 2025
Author

rasbt Mar 24, 2025
Maintainer

STEVENTAN100 Mar 24, 2025
Author

STEVENTAN100 Mar 25, 2025
Author