Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Problems to run using a multiGPU setup and lightning #3360

Open
avelinoapheris opened this issue Apr 3, 2025 · 2 comments
Open

[BUG] Problems to run using a multiGPU setup and lightning #3360

avelinoapheris opened this issue Apr 3, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@avelinoapheris
Copy link

avelinoapheris commented Apr 3, 2025

Describe the bug

I am trying to run a job using the lightning api, on a multi GPU setup. The setup works fine if I use only one GPU, but as soon as I change the number of devices to be larger than one, the program crashes when trying to initialise the client_api. I did some digging and I know the problem is in this line. The data_bus in processes where rank>0 have an empty data_bus.

I also tested the current version in main since it looks different but the problem is the same, just the piece of code was moved here instead.

To Reproduce

### client.py ###
from nvflare.app_opt.pt.job_config.fed_avg import FedAvgJob
from nvflare.job_config.script_runner import ScriptRunner
from nvflare.private.fed.app.simulator.simulator_runner import SimulatorRunner
from client import LitModel

if __name__ == "__main__":
    n_clients = 1
    num_rounds = 2
    train_script = "client.py"
    name = "test-multi-gpu"
    job = FedAvgJob(
        name=name, n_clients=n_clients, num_rounds=num_rounds, initial_model=LitModel()
    )

    # Add clients
    for i in range(n_clients):
        executor = ScriptRunner(
            script=train_script, script_args=""
        )
        job.to(executor, f"site-{i + 1}")

    workspace = "/tmp/nvflare/workspace"
    job_dir = "/tmp/nvflare/jobs/job_config"
    job.export_job(job_dir)
    simulator = SimulatorRunner(
                job_folder=f"{job_dir}/{name}",
                workspace=workspace,
                clients="1".join([f"site-{i + 1}" for i in range(n_clients)]),
                n_clients=n_clients,
                threads=n_clients,
                end_run_for_all=True,
            )
    run_status = simulator.run()

### job.py ###


from nvflare.app_opt.pt.job_config.fed_avg import FedAvgJob
from nvflare.job_config.script_runner import ScriptRunner
from nvflare.private.fed.app.simulator.simulator_runner import SimulatorRunner
from client import LitModel

if __name__ == "__main__":
    n_clients = 1
    num_rounds = 2
    train_script = "client.py"
    name = "test-multi-gpu"
    job = FedAvgJob(
        name=name, n_clients=n_clients, num_rounds=num_rounds, initial_model=LitModel()
    )

    # Add clients
    for i in range(n_clients):
        executor = ScriptRunner(
            script=train_script, script_args=""
        )
        job.to(executor, f"site-{i + 1}")

    workspace = "/tmp/nvflare/workspace"
    job_dir = "/tmp/nvflare/jobs/job_config"
    job.export_job(job_dir)
    simulator = SimulatorRunner(
                job_folder=f"{job_dir}/{name}",
                workspace=workspace,
                clients="1".join([f"site-{i + 1}" for i in range(n_clients)]),
                n_clients=n_clients,
                threads=n_clients,
                end_run_for_all=True,
            )
    run_status = simulator.run()

just run python job.py on the same folder with client.py and job.py.

Expected behavior
Trainer will run and get a traceback like:

The traceback is:

raceback (most recent call last):
  File "/tmp/nvflare/workspace/site-1/simulate_job/app_site-1/custom/client.py", line 63, in <module>
    flare.patch(trainer)
  File "/home/avelinojaver/miniconda3/envs/model-huggingface/lib/python3.10/site-packages/nvflare/app_opt/lightning/api.py", line 75, in patch
    fl_callback = FLCallback(rank=trainer.global_rank, load_state_dict_strict=load_state_dict_strict)
  File "/home/avelinojaver/miniconda3/envs/model-huggingface/lib/python3.10/site-packages/nvflare/app_opt/lightning/api.py", line 95, in __init__
    init(rank=str(rank))
  File "/home/avelinojaver/miniconda3/envs/model-huggingface/lib/python3.10/site-packages/nvflare/client/api.py", line 55, in init
    client_api.init(rank=rank)
AttributeError: 'NoneType' object has no attribute 'init'

Desktop (please complete the following information):

  • OS: ubuntu 22.04
  • Python Version [3.10
  • NVFlare Version 2.5.2
  • There must be two GPUs in your system.
@avelinoapheris avelinoapheris added the bug Something isn't working label Apr 3, 2025
@chesterxgchen
Copy link
Collaborator

This shouldn't work for multi-GPU. The ScriptRunner is wrapper for InProcessClientAPI , which is designed for in-memory message exchange between Executor and training code.

For Multi-GPU, the SubPrcessLauncherExecutor should be used, where the executor and training scripts are in different processes and communicates via CellPipe.

You mentioned (in separate threat) this one doesn't work.
https://github.com/NVIDIA/NVFlare/tree/main/examples/hello-world/ml-to-fl/pt#transform-cifar10-pytorch-lightning--ddp-training-code-to-fl-with-nvflare-client-lightning-integration-api

Notice you are using the "main" branch ( which is dev branch), did you check 2.5 branch for the same example ?

@avelinoapheris
Copy link
Author

Thanks a lot for the clarification @chesterxgche. That make a lot of sense but Here is mentioned that the ScriptRunner can change to use ClientAPILauncherExecutor if launch_external_process=True.

About this script , I tested the script in both the main branch and the branch 2.5. The relevant error i have is:

2025-04-07 14:57:43,399 - SubprocessLauncher - INFO -     raise MisconfigurationException(
2025-04-07 14:57:43,399 - SubprocessLauncher - INFO - lightning_fabric.utilities.exceptions.MisconfigurationException: You requested gpu: [0, 1]
2025-04-07 14:57:43,399 - SubprocessLauncher - INFO -  But your machine only has: [0]

It seems to me that the issue is that we are requesting two devices here, but the client api is initialising only one device here.
However, if I change that line to job.simulator_run("/tmp/nvflare/jobs/workdir", gpu="0,1") and then the call the script python3 pt_client_api_job.py --script src/cifar10_lightning_ddp_fl.py --key_metric val_acc_epoch --launch_process --n_clients 1. I gest the error:

2025-04-07 15:03:29,902 - SimulatorRunner - ERROR - The number of clients (1) must be larger than or equal to the number of GPU groups: (2)

Is there a hard constrain on using simulator mode with more than one device per client?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants