[BUG] Problems to run using a multiGPU setup and lightning #3360

avelinoapheris · 2025-04-03T17:02:03Z

Describe the bug

I am trying to run a job using the lightning api, on a multi GPU setup. The setup works fine if I use only one GPU, but as soon as I change the number of devices to be larger than one, the program crashes when trying to initialise the client_api. I did some digging and I know the problem is in this line. The data_bus in processes where rank>0 have an empty data_bus.

I also tested the current version in main since it looks different but the problem is the same, just the piece of code was moved here instead.

To Reproduce

### client.py ###
from nvflare.app_opt.pt.job_config.fed_avg import FedAvgJob
from nvflare.job_config.script_runner import ScriptRunner
from nvflare.private.fed.app.simulator.simulator_runner import SimulatorRunner
from client import LitModel

if __name__ == "__main__":
    n_clients = 1
    num_rounds = 2
    train_script = "client.py"
    name = "test-multi-gpu"
    job = FedAvgJob(
        name=name, n_clients=n_clients, num_rounds=num_rounds, initial_model=LitModel()
    )

    # Add clients
    for i in range(n_clients):
        executor = ScriptRunner(
            script=train_script, script_args=""
        )
        job.to(executor, f"site-{i + 1}")

    workspace = "/tmp/nvflare/workspace"
    job_dir = "/tmp/nvflare/jobs/job_config"
    job.export_job(job_dir)
    simulator = SimulatorRunner(
                job_folder=f"{job_dir}/{name}",
                workspace=workspace,
                clients="1".join([f"site-{i + 1}" for i in range(n_clients)]),
                n_clients=n_clients,
                threads=n_clients,
                end_run_for_all=True,
            )
    run_status = simulator.run()

### job.py ###


from nvflare.app_opt.pt.job_config.fed_avg import FedAvgJob
from nvflare.job_config.script_runner import ScriptRunner
from nvflare.private.fed.app.simulator.simulator_runner import SimulatorRunner
from client import LitModel

if __name__ == "__main__":
    n_clients = 1
    num_rounds = 2
    train_script = "client.py"
    name = "test-multi-gpu"
    job = FedAvgJob(
        name=name, n_clients=n_clients, num_rounds=num_rounds, initial_model=LitModel()
    )

    # Add clients
    for i in range(n_clients):
        executor = ScriptRunner(
            script=train_script, script_args=""
        )
        job.to(executor, f"site-{i + 1}")

    workspace = "/tmp/nvflare/workspace"
    job_dir = "/tmp/nvflare/jobs/job_config"
    job.export_job(job_dir)
    simulator = SimulatorRunner(
                job_folder=f"{job_dir}/{name}",
                workspace=workspace,
                clients="1".join([f"site-{i + 1}" for i in range(n_clients)]),
                n_clients=n_clients,
                threads=n_clients,
                end_run_for_all=True,
            )
    run_status = simulator.run()

just run python job.py on the same folder with client.py and job.py.

Expected behavior
Trainer will run and get a traceback like:

The traceback is:

raceback (most recent call last):
  File "/tmp/nvflare/workspace/site-1/simulate_job/app_site-1/custom/client.py", line 63, in <module>
    flare.patch(trainer)
  File "/home/avelinojaver/miniconda3/envs/model-huggingface/lib/python3.10/site-packages/nvflare/app_opt/lightning/api.py", line 75, in patch
    fl_callback = FLCallback(rank=trainer.global_rank, load_state_dict_strict=load_state_dict_strict)
  File "/home/avelinojaver/miniconda3/envs/model-huggingface/lib/python3.10/site-packages/nvflare/app_opt/lightning/api.py", line 95, in __init__
    init(rank=str(rank))
  File "/home/avelinojaver/miniconda3/envs/model-huggingface/lib/python3.10/site-packages/nvflare/client/api.py", line 55, in init
    client_api.init(rank=rank)
AttributeError: 'NoneType' object has no attribute 'init'

Desktop (please complete the following information):

OS: ubuntu 22.04
Python Version [3.10
NVFlare Version 2.5.2
There must be two GPUs in your system.

The text was updated successfully, but these errors were encountered:

chesterxgchen · 2025-04-04T17:27:54Z

This shouldn't work for multi-GPU. The ScriptRunner is wrapper for InProcessClientAPI , which is designed for in-memory message exchange between Executor and training code.

For Multi-GPU, the SubPrcessLauncherExecutor should be used, where the executor and training scripts are in different processes and communicates via CellPipe.

You mentioned (in separate threat) this one doesn't work.
https://github.com/NVIDIA/NVFlare/tree/main/examples/hello-world/ml-to-fl/pt#transform-cifar10-pytorch-lightning--ddp-training-code-to-fl-with-nvflare-client-lightning-integration-api

Notice you are using the "main" branch ( which is dev branch), did you check 2.5 branch for the same example ?

avelinoapheris · 2025-04-07T15:19:11Z

Thanks a lot for the clarification @chesterxgche. That make a lot of sense but Here is mentioned that the ScriptRunner can change to use ClientAPILauncherExecutor if launch_external_process=True.

About this script , I tested the script in both the main branch and the branch 2.5. The relevant error i have is:

2025-04-07 14:57:43,399 - SubprocessLauncher - INFO -     raise MisconfigurationException(
2025-04-07 14:57:43,399 - SubprocessLauncher - INFO - lightning_fabric.utilities.exceptions.MisconfigurationException: You requested gpu: [0, 1]
2025-04-07 14:57:43,399 - SubprocessLauncher - INFO -  But your machine only has: [0]

It seems to me that the issue is that we are requesting two devices here, but the client api is initialising only one device here.
However, if I change that line to job.simulator_run("/tmp/nvflare/jobs/workdir", gpu="0,1") and then the call the script python3 pt_client_api_job.py --script src/cifar10_lightning_ddp_fl.py --key_metric val_acc_epoch --launch_process --n_clients 1. I gest the error:

2025-04-07 15:03:29,902 - SimulatorRunner - ERROR - The number of clients (1) must be larger than or equal to the number of GPU groups: (2)

Is there a hard constrain on using simulator mode with more than one device per client?

avelinoapheris added the bug Something isn't working label Apr 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Problems to run using a multiGPU setup and lightning #3360

[BUG] Problems to run using a multiGPU setup and lightning #3360

avelinoapheris commented Apr 3, 2025 •

edited

Loading

chesterxgchen commented Apr 4, 2025

avelinoapheris commented Apr 7, 2025

[BUG] Problems to run using a multiGPU setup and lightning #3360

[BUG] Problems to run using a multiGPU setup and lightning #3360

Comments

avelinoapheris commented Apr 3, 2025 • edited Loading

chesterxgchen commented Apr 4, 2025

avelinoapheris commented Apr 7, 2025

avelinoapheris commented Apr 3, 2025 •

edited

Loading