You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run a job using the lightning api, on a multi GPU setup. The setup works fine if I use only one GPU, but as soon as I change the number of devices to be larger than one, the program crashes when trying to initialise the client_api. I did some digging and I know the problem is in this line. The data_bus in processes where rank>0 have an empty data_bus.
I also tested the current version in main since it looks different but the problem is the same, just the piece of code was moved here instead.
To Reproduce
### client.py ###
from nvflare.app_opt.pt.job_config.fed_avg import FedAvgJob
from nvflare.job_config.script_runner import ScriptRunner
from nvflare.private.fed.app.simulator.simulator_runner import SimulatorRunner
from client import LitModel
if __name__ == "__main__":
n_clients = 1
num_rounds = 2
train_script = "client.py"
name = "test-multi-gpu"
job = FedAvgJob(
name=name, n_clients=n_clients, num_rounds=num_rounds, initial_model=LitModel()
)
# Add clients
for i in range(n_clients):
executor = ScriptRunner(
script=train_script, script_args=""
)
job.to(executor, f"site-{i + 1}")
workspace = "/tmp/nvflare/workspace"
job_dir = "/tmp/nvflare/jobs/job_config"
job.export_job(job_dir)
simulator = SimulatorRunner(
job_folder=f"{job_dir}/{name}",
workspace=workspace,
clients="1".join([f"site-{i + 1}" for i in range(n_clients)]),
n_clients=n_clients,
threads=n_clients,
end_run_for_all=True,
)
run_status = simulator.run()
### job.py ###
from nvflare.app_opt.pt.job_config.fed_avg import FedAvgJob
from nvflare.job_config.script_runner import ScriptRunner
from nvflare.private.fed.app.simulator.simulator_runner import SimulatorRunner
from client import LitModel
if __name__ == "__main__":
n_clients = 1
num_rounds = 2
train_script = "client.py"
name = "test-multi-gpu"
job = FedAvgJob(
name=name, n_clients=n_clients, num_rounds=num_rounds, initial_model=LitModel()
)
# Add clients
for i in range(n_clients):
executor = ScriptRunner(
script=train_script, script_args=""
)
job.to(executor, f"site-{i + 1}")
workspace = "/tmp/nvflare/workspace"
job_dir = "/tmp/nvflare/jobs/job_config"
job.export_job(job_dir)
simulator = SimulatorRunner(
job_folder=f"{job_dir}/{name}",
workspace=workspace,
clients="1".join([f"site-{i + 1}" for i in range(n_clients)]),
n_clients=n_clients,
threads=n_clients,
end_run_for_all=True,
)
run_status = simulator.run()
just run python job.py on the same folder with client.py and job.py.
Expected behavior
Trainer will run and get a traceback like:
The traceback is:
raceback (most recent call last):
File "/tmp/nvflare/workspace/site-1/simulate_job/app_site-1/custom/client.py", line 63, in <module>
flare.patch(trainer)
File "/home/avelinojaver/miniconda3/envs/model-huggingface/lib/python3.10/site-packages/nvflare/app_opt/lightning/api.py", line 75, in patch
fl_callback = FLCallback(rank=trainer.global_rank, load_state_dict_strict=load_state_dict_strict)
File "/home/avelinojaver/miniconda3/envs/model-huggingface/lib/python3.10/site-packages/nvflare/app_opt/lightning/api.py", line 95, in __init__
init(rank=str(rank))
File "/home/avelinojaver/miniconda3/envs/model-huggingface/lib/python3.10/site-packages/nvflare/client/api.py", line 55, in init
client_api.init(rank=rank)
AttributeError: 'NoneType' object has no attribute 'init'
Desktop (please complete the following information):
OS: ubuntu 22.04
Python Version [3.10
NVFlare Version 2.5.2
There must be two GPUs in your system.
The text was updated successfully, but these errors were encountered:
This shouldn't work for multi-GPU. The ScriptRunner is wrapper for InProcessClientAPI , which is designed for in-memory message exchange between Executor and training code.
For Multi-GPU, the SubPrcessLauncherExecutor should be used, where the executor and training scripts are in different processes and communicates via CellPipe.
Thanks a lot for the clarification @chesterxgche. That make a lot of sense but Here is mentioned that the ScriptRunner can change to use ClientAPILauncherExecutor if launch_external_process=True.
About this script , I tested the script in both the main branch and the branch 2.5. The relevant error i have is:
2025-04-07 14:57:43,399 - SubprocessLauncher - INFO - raise MisconfigurationException(
2025-04-07 14:57:43,399 - SubprocessLauncher - INFO - lightning_fabric.utilities.exceptions.MisconfigurationException: You requested gpu: [0, 1]
2025-04-07 14:57:43,399 - SubprocessLauncher - INFO - But your machine only has: [0]
It seems to me that the issue is that we are requesting two devices here, but the client api is initialising only one device here.
However, if I change that line to job.simulator_run("/tmp/nvflare/jobs/workdir", gpu="0,1") and then the call the script python3 pt_client_api_job.py --script src/cifar10_lightning_ddp_fl.py --key_metric val_acc_epoch --launch_process --n_clients 1. I gest the error:
2025-04-07 15:03:29,902 - SimulatorRunner - ERROR - The number of clients (1) must be larger than or equal to the number of GPU groups: (2)
Is there a hard constrain on using simulator mode with more than one device per client?
Describe the bug
I am trying to run a job using the lightning api, on a multi GPU setup. The setup works fine if I use only one GPU, but as soon as I change the number of devices to be larger than one, the program crashes when trying to initialise the client_api. I did some digging and I know the problem is in this line. The data_bus in processes where
rank>0
have an emptydata_bus
.I also tested the current version in main since it looks different but the problem is the same, just the piece of code was moved here instead.
To Reproduce
just run
python job.py
on the same folder withclient.py
andjob.py
.Expected behavior
Trainer will run and get a traceback like:
The traceback is:
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: