单机多卡预训练BUG #7604

zhangtianhong-1998 · 2025-04-04T14:22:46Z

Reminder

I have read the above rules and searched the existing issues.

System Info

开启use_unsloth后[rank2]: RuntimeError: No backend type associated with device type cpu
前序疑问：optim: paged_adamw_8bit参数配置好像并未生效，（按理说8*H100应该在7B最长时不会导致显存炸掉，但是目前好像是会有点炸，试图切换use_unsloth，出现相关错误）

1. 使用fsdp

export NCCL_DEBUG=INFO
export USE_MODELSCOPE_HUB=1
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch
--config_file examples/accelerate/fsdp_config.yaml
src/train.py examples/qwen/qwen_cpt_full.yaml

2. 具体配置如下，

model_name_or_path: Qwen/Qwen2.5-7B-Instruct
trust_remote_code: true
packing: False

method

stage: pt
do_train: true
finetuning_type: full
flash_attn: fa2
use_unsloth: True

dataset

dataset: grout_en_1
template: qwen
cutoff_len: 32768
max_samples: 50000
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: saves/QwQ-32B/pt
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 8
disable_gradient_checkpointing: False
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.01
bf16: true
ddp_timeout: 180000000
optim: paged_adamw_8bit

Reproduction

[WARNING|2025-04-04 21:25:08] llamafactory.model.model_utils.attention:148 >> FlashAttention-2 is not installed.
[INFO|2025-04-04 21:25:08] llamafactory.model.model_utils.kv_cache:143 >> KV cache is disabled during training.
/root/LLaMA-Factory-main/src/llamafactory/model/model_utils/unsloth.py:51: UserWarning: WARNING: Unsloth should be imported before trl, transformers, peft to ensure all optimizations are applied. Your code may run slower or encounter memory issues without these optimizations.

Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel  # type: ignore
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
/root/LLaMA-Factory-main/src/llamafactory/model/model_utils/unsloth.py:51: UserWarning: WARNING: Unsloth should be imported before trl, transformers, peft to ensure all optimizations are applied. Your code may run slower or encounter memory issues without these optimizations.

Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel  # type: ignore
/root/LLaMA-Factory-main/src/llamafactory/model/model_utils/unsloth.py:51: UserWarning: WARNING: Unsloth should be imported before trl, transformers, peft to ensure all optimizations are applied. Your code may run slower or encounter memory issues without these optimizations.

Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel  # type: ignore
/root/LLaMA-Factory-main/src/llamafactory/model/model_utils/unsloth.py:51: UserWarning: WARNING: Unsloth should be imported before trl, transformers, peft to ensure all optimizations are applied. Your code may run slower or encounter memory issues without these optimizations.

Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel  # type: ignore
/root/LLaMA-Factory-main/src/llamafactory/model/model_utils/unsloth.py:51: UserWarning: WARNING: Unsloth should be imported before trl, transformers, peft to ensure all optimizations are applied. Your code may run slower or encounter memory issues without these optimizations.

Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel  # type: ignore
/root/LLaMA-Factory-main/src/llamafactory/model/model_utils/unsloth.py:51: UserWarning: WARNING: Unsloth should be imported before trl, transformers, peft to ensure all optimizations are applied. Your code may run slower or encounter memory issues without these optimizations.

Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel  # type: ignore
/root/LLaMA-Factory-main/src/llamafactory/model/model_utils/unsloth.py:51: UserWarning: WARNING: Unsloth should be imported before trl, transformers, peft to ensure all optimizations are applied. Your code may run slower or encounter memory issues without these optimizations.

Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel  # type: ignore
/root/LLaMA-Factory-main/src/llamafactory/model/model_utils/unsloth.py:51: UserWarning: WARNING: Unsloth should be imported before trl, transformers, peft to ensure all optimizations are applied. Your code may run slower or encounter memory issues without these optimizations.

Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel  # type: ignore
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
🦥 Unsloth Zoo will now patch everything to make training faster!
🦥 Unsloth Zoo will now patch everything to make training faster!
🦥 Unsloth Zoo will now patch everything to make training faster!
🦥 Unsloth Zoo will now patch everything to make training faster!
🦥 Unsloth Zoo will now patch everything to make training faster!
🦥 Unsloth Zoo will now patch everything to make training faster!
🦥 Unsloth Zoo will now patch everything to make training faster!


[rank2]: Traceback (most recent call last):
[rank2]:   File "/root/LLaMA-Factory-main/src/train.py", line 28, in <module>
[rank2]:     main()
[rank2]:   File "/root/LLaMA-Factory-main/src/train.py", line 19, in main
[rank2]:     run_exp()
[rank2]:   File "/root/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 107, in run_exp
[rank2]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank2]:   File "/root/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 67, in _training_function
[rank2]:     run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
[rank2]:   File "/root/LLaMA-Factory-main/src/llamafactory/train/pt/workflow.py", line 47, in run_pt
[rank2]:     model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
[rank2]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/LLaMA-Factory-main/src/llamafactory/model/loader.py", line 136, in load_model
[rank2]:     model = load_unsloth_pretrained_model(config, model_args)
[rank2]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/LLaMA-Factory-main/src/llamafactory/model/model_utils/unsloth.py", line 51, in load_unsloth_pretrained_model
[rank2]:     from unsloth import FastLanguageModel  # type: ignore
[rank2]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/unsloth/__init__.py", line 219, in <module>
[rank2]:     from .models import *
[rank2]:   File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/unsloth/models/__init__.py", line 15, in <module>
[rank2]:     from .llama   import FastLlamaModel
[rank2]:   File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/unsloth/models/llama.py", line 2748, in <module>
[rank2]:     PatchFastRL(FastLanguageModel = FastLlamaModel)
[rank2]:   File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/unsloth/models/rl.py", line 742, in PatchFastRL
[rank2]:     patch_trl_rl_trainers()
[rank2]:   File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/unsloth/models/rl.py", line 735, in patch_trl_rl_trainers
[rank2]:     _patch_trl_rl_trainers(trainer)
[rank2]:   File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/unsloth/models/rl.py", line 555, in _patch_trl_rl_trainers
[rank2]:     created_module = create_new_function(
[rank2]:                      ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 329, in create_new_function
[rank2]:     compile_folder, UNSLOTH_COMPILE_USE_TEMP = get_compile_folder(use_tempfile = False)
[rank2]:                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 265, in get_compile_folder
[rank2]:     location, UNSLOTH_COMPILE_USE_TEMP = distributed_function(2, _get_compile_folder, use_tempfile)
[rank2]:                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/unsloth_zoo/utils.py", line 82, in distributed_function
[rank2]:     torch.distributed.broadcast_object_list(object_list, src = 0, device = "cpu")
[rank2]:   File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 3479, in broadcast_object_list
[rank2]:     broadcast(object_sizes_tensor, src=global_src, group=group)
[rank2]:   File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank2]:     work = group.broadcast([tensor], opts)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: RuntimeError: No backend type associated with device type cpu
gpunode69:485470:489662 [4] NCCL INFO [Service thread] Connection closed by localRank 4
[rank0]:[W404 21:25:10.951414099 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
gpunode69:485466:489660 [0] NCCL INFO [Service thread] Connection closed by localRank 0
gpunode69:485468:489654 [2] NCCL INFO [Service thread] Connection closed by localRank 2
gpunode69:485467:489666 [1] NCCL INFO [Service thread] Connection closed by localRank 1
gpunode69:485473:489664 [7] NCCL INFO [Service thread] Connection closed by localRank 7
gpunode69:485472:489658 [6] NCCL INFO [Service thread] Connection closed by localRank 6
gpunode69:485469:489656 [3] NCCL INFO [Service thread] Connection closed by localRank 3
gpunode69:485471:489652 [5] NCCL INFO [Service thread] Connection closed by localRank 5
gpunode69:485470:493054 [4] NCCL INFO comm 0x560ffb54fe70 rank 4 nranks 8 cudaDev 4 busId 9a000 - Abort COMPLETE
gpunode69:485466:493055 [0] NCCL INFO comm 0x55edfe34ed40 rank 0 nranks 8 cudaDev 0 busId 18000 - Abort COMPLETE
gpunode69:485471:493100 [5] NCCL INFO comm 0x5632eab06140 rank 5 nranks 8 cudaDev 5 busId ba000 - Abort COMPLETE
gpunode69:485469:493099 [3] NCCL INFO comm 0x55dbc464f710 rank 3 nranks 8 cudaDev 3 busId 5d000 - Abort COMPLETE
gpunode69:485468:493056 [2] NCCL INFO comm 0x55f411632170 rank 2 nranks 8 cudaDev 2 busId 4c000 - Abort COMPLETE
gpunode69:485472:493098 [6] NCCL INFO comm 0x559b47f1c9e0 rank 6 nranks 8 cudaDev 6 busId cb000 - Abort COMPLETE
gpunode69:485473:493097 [7] NCCL INFO comm 0x56460daca6d0 rank 7 nranks 8 cudaDev 7 busId db000 - Abort COMPLETE
gpunode69:485467:493096 [1] NCCL INFO comm 0x556ba2531e50 rank 1 nranks 8 cudaDev 1 busId 3a000 - Abort COMPLETE
W0404 21:25:12.603000 485307 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 485467 closing signal SIGTERM
W0404 21:25:12.604000 485307 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 485469 closing signal SIGTERM
W0404 21:25:12.604000 485307 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 485470 closing signal SIGTERM
W0404 21:25:12.605000 485307 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 485471 closing signal SIGTERM
W0404 21:25:12.605000 485307 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 485472 closing signal SIGTERM
W0404 21:25:12.605000 485307 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 485473 closing signal SIGTERM
E0404 21:25:13.784000 485307 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 485466) of binary: /root/anaconda3/envs/factory/bin/python3.12
Traceback (most recent call last):
  File "/root/anaconda3/envs/factory/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1179, in launch_command
    multi_gpu_launcher(args)
  File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/accelerate/commands/launch.py", line 810, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/factory/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
src/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-04-04_21:25:12
  host      : gpunode69
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 485468)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-04-04_21:25:12
  host      : gpunode69
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 485466)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Others

No response

The text was updated successfully, but these errors were encountered:

Smile-L-up · 2025-04-07T03:52:59Z

遇到一样的问题，不开启unsloth模式能正常LORA微调，开启后会报错。

Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel  # type: ignore
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
/root/Smile_L/LLaMA-Factory/src/llamafactory/model/model_utils/unsloth.py:51: UserWarning: WARNING: Unsloth should be imported before trl, transformers, peft to ensure all optimizations are applied. Your code may run slower or encounter memory issues without these optimizations.

Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel  # type: ignore
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
🦥 Unsloth Zoo will now patch everything to make training faster!
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/Smile_L/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/root/Smile_L/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/root/Smile_L/LLaMA-Factory/src/llamafactory/train/tuner.py", line 107, in run_exp
[rank0]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank0]:   File "/root/Smile_L/LLaMA-Factory/src/llamafactory/train/tuner.py", line 69, in _training_function
[rank0]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]:   File "/root/Smile_L/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 52, in run_sft
[rank0]:     model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
[rank0]:   File "/root/Smile_L/LLaMA-Factory/src/llamafactory/model/loader.py", line 136, in load_model
[rank0]:     model = load_unsloth_pretrained_model(config, model_args)
[rank0]:   File "/root/Smile_L/LLaMA-Factory/src/llamafactory/model/model_utils/unsloth.py", line 51, in load_unsloth_pretrained_model
[rank0]:     from unsloth import FastLanguageModel  # type: ignore
[rank0]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth/__init__.py", line 219, in <module>
[rank0]:     from .models import *
[rank0]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth/models/__init__.py", line 15, in <module>
[rank0]:     from .llama   import FastLlamaModel
[rank0]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth/models/llama.py", line 2748, in <module>
[rank0]:     PatchFastRL(FastLanguageModel = FastLlamaModel)
[rank0]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth/models/rl.py", line 742, in PatchFastRL
[rank0]:     patch_trl_rl_trainers()
[rank0]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth/models/rl.py", line 735, in patch_trl_rl_trainers
[rank0]:     _patch_trl_rl_trainers(trainer)
[rank0]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth/models/rl.py", line 555, in _patch_trl_rl_trainers
[rank0]:     created_module = create_new_function(
[rank0]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth_zoo/compiler.py", line 329, in create_new_function
[rank0]:     compile_folder, UNSLOTH_COMPILE_USE_TEMP = get_compile_folder(use_tempfile = False)
[rank0]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth_zoo/compiler.py", line 265, in get_compile_folder
[rank0]:     location, UNSLOTH_COMPILE_USE_TEMP = distributed_function(2, _get_compile_folder, use_tempfile)
[rank0]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth_zoo/utils.py", line 82, in distributed_function
[rank0]:     torch.distributed.broadcast_object_list(object_list, src = 0, device = "cpu")
[rank0]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3479, in broadcast_object_list
[rank0]:     broadcast(object_sizes_tensor, src=global_src, group=group)
[rank0]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank0]:     work = group.broadcast([tensor], opts)
[rank0]: RuntimeError: No backend type associated with device type cpu
[rank1]: Traceback (most recent call last):
[rank1]:   File "/root/Smile_L/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank1]:     launch()
[rank1]:   File "/root/Smile_L/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]:     run_exp()
[rank1]:   File "/root/Smile_L/LLaMA-Factory/src/llamafactory/train/tuner.py", line 107, in run_exp
[rank1]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank1]:   File "/root/Smile_L/LLaMA-Factory/src/llamafactory/train/tuner.py", line 69, in _training_function
[rank1]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]:   File "/root/Smile_L/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 52, in run_sft
[rank1]:     model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
[rank1]:   File "/root/Smile_L/LLaMA-Factory/src/llamafactory/model/loader.py", line 136, in load_model
[rank1]:     model = load_unsloth_pretrained_model(config, model_args)
[rank1]:   File "/root/Smile_L/LLaMA-Factory/src/llamafactory/model/model_utils/unsloth.py", line 51, in load_unsloth_pretrained_model
[rank1]:     from unsloth import FastLanguageModel  # type: ignore
[rank1]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth/__init__.py", line 219, in <module>
[rank1]:     from .models import *
[rank1]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth/models/__init__.py", line 15, in <module>
[rank1]:     from .llama   import FastLlamaModel
[rank1]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth/models/llama.py", line 2748, in <module>
[rank1]:     PatchFastRL(FastLanguageModel = FastLlamaModel)
[rank1]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth/models/rl.py", line 742, in PatchFastRL
[rank1]:     patch_trl_rl_trainers()
[rank1]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth/models/rl.py", line 735, in patch_trl_rl_trainers
[rank1]:     _patch_trl_rl_trainers(trainer)
[rank1]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth/models/rl.py", line 555, in _patch_trl_rl_trainers
[rank1]:     created_module = create_new_function(
[rank1]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth_zoo/compiler.py", line 329, in create_new_function
[rank1]:     compile_folder, UNSLOTH_COMPILE_USE_TEMP = get_compile_folder(use_tempfile = False)
[rank1]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth_zoo/compiler.py", line 265, in get_compile_folder
[rank1]:     location, UNSLOTH_COMPILE_USE_TEMP = distributed_function(2, _get_compile_folder, use_tempfile)
[rank1]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth_zoo/utils.py", line 82, in distributed_function
[rank1]:     torch.distributed.broadcast_object_list(object_list, src = 0, device = "cpu")
[rank1]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3479, in broadcast_object_list
[rank1]:     broadcast(object_sizes_tensor, src=global_src, group=group)
[rank1]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank1]:     work = group.broadcast([tensor], opts)
[rank1]: RuntimeError: No backend type associated with device type cpu
[rank0]:[W407 03:46:19.275497536 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0407 03:46:21.143000 384969 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 385068 closing signal SIGTERM
E0407 03:46:21.207000 384969 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 385067) of binary: /root/anaconda3/envs/llama_factory/bin/python
Traceback (most recent call last):
  File "/root/anaconda3/envs/llama_factory/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/root/Smile_L/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------

hiyouga · 2025-04-07T09:14:10Z

关闭 unsloth

zhangtianhong-1998 added bug Something isn't working pending This problem is yet to be addressed labels Apr 4, 2025

hiyouga added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels Apr 7, 2025

hiyouga closed this as completed Apr 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

单机多卡预训练BUG #7604

单机多卡预训练BUG #7604

zhangtianhong-1998 commented Apr 4, 2025

Smile-L-up commented Apr 7, 2025

hiyouga commented Apr 7, 2025

单机多卡预训练BUG #7604

单机多卡预训练BUG #7604

Comments

zhangtianhong-1998 commented Apr 4, 2025

Reminder

System Info

1. 使用fsdp

2. 具体配置如下，

method

dataset

output

train

Reproduction

Others

Smile-L-up commented Apr 7, 2025

hiyouga commented Apr 7, 2025