-
Notifications
You must be signed in to change notification settings - Fork 28.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modeling_t5 incompatible with multiprocessing #30280
Comments
t5 is a fairly old model, this is probably expected? If you find a fix feel free to open a PR! 🤗 |
Yes, but strangely enough, Bart supports it. I would be happy to give it a try, but before that, I would like to know if this issue can be reproduced? This will help me further reduce the scope of investigation. |
To be honest, multiprocessing is outside of the scope of transformers, and we usually recommend the usage of |
I think there is a requirement that even if there is sufficient gpu memory, we hope to distribute data to many cards, so as to use multiple GPUs for parallel inference. This behavior is somewhat similar to DDP, but does not involve the partitioning of parameters/states. Multiprocessing is a part of DDP, and I have essentially extracted the smallest part to achieve this. In 2023, I saw a 🤗staff at Forum mention to support this matter, and since I haven't seen any relevant features yet, I tried to implement it myself. |
The most difficult thing for me may be that debugging in a multi process situation is very complex, and PDB cannot set breakpoints properly. 😟 |
@ArthurZucker and @rangehow can I try it out? |
Ofcourse!just do it 🎉 |
System Info
transformers
version: 4.39.0.dev0- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 16, 'zero3_init_flag': False, 'zero_stage': 0}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Who can help?
Hi, @ArthurZucker and @younesbelkada . I'm trying to split a dataset automatically to multi gpu (a bit like data parallel) for inference. But strange things happen when using t5 model in hf while other models work correctly(i.e. bart), so I guess here exist some problem related to t5 implementation, would you like help checking it out? :)
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The following code should be quite easy to reproduce. All you need to do is replace the model_dir in the main function with a specific model, such as Google/t5-v1_1-large , and make sure CUDA VISIBLE DEVICES >1 .
Expected behavior
t5 model can inference in multiprocessing.
The text was updated successfully, but these errors were encountered: