modeling_t5 incompatible with multiprocessing #30280

rangehow · 2024-04-17T03:29:58Z

System Info

transformers version: 4.39.0.dev0
Platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.10.13
Huggingface_hub version: 0.21.4
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 16, 'zero3_init_flag': False, 'zero_stage': 0}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.2.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

Hi, @ArthurZucker and @younesbelkada . I'm trying to split a dataset automatically to multi gpu (a bit like data parallel) for inference. But strange things happen when using t5 model in hf while other models work correctly(i.e. bart), so I guess here exist some problem related to t5 implementation, would you like help checking it out? ：）

Although it has been mentioned online that the error below may be related to OOM, I am certain that it is not. The following code only allows rank0 to obtain normal output, while other ranks will report the following error.

Traceback (most recent call last):
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/data/ruanjh/NiuInference/NiuInference.py", line 97, in get_pred
    output = model.generate(
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/generation/utils.py", line 1388, in generate
    model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/generation/utils.py", line 503, in _prepare_encoder_decoder_kwargs_for_generation
    model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 1115, in forward
    layer_outputs = layer_module(
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 695, in forward
    self_attention_outputs = self.layer[0](
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 602, in forward
    attention_output = self.SelfAttention(
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 521, in forward
    query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length, dim_per_head)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The following code should be quite easy to reproduce. All you need to do is replace the model_dir in the main function with a specific model, such as Google/t5-v1_1-large , and make sure CUDA VISIBLE DEVICES >1 .

import torch
from torch import bfloat16
import torch.distributed as dist
import torch.multiprocessing as mp

from torch.utils.data import Dataset,DataLoader
import functools
from transformers import AutoTokenizer,DefaultDataCollator,GenerationConfig,PreTrainedModel,AutoModelForSeq2SeqLM,AutoModelForCausalLM,AutoConfig,DataCollatorWithPadding
import logging
from transformers.models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
from tqdm import tqdm
# from accelerate import find_executable_batch_size

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)



class DefaultDataset(Dataset):
    def __init__(self,data,tokenizer):
        self.data=tokenizer(data,return_tensors='pt',padding=True)

    
    def __getitem__(self,idx):
        return {'input_ids':self.data['input_ids'][idx]}
    
    def __len__(self):
        return self.data['input_ids'].size(0)




class NiuInference:
    def __init__(self,model_dir,data,dtype=bfloat16,dataset=None,data_collator=None,output_path='niuinference.out',auto_batch_size=True,batch_size=1,generation_config=None):
        self.model_dir=model_dir
        self.dtype=dtype
        self.data=data
        self.dataset=dataset
        self.data_collator=data_collator
        self.output_path=output_path
        self.batch_size=batch_size
        self.auto_batch_size=auto_batch_size
        self.generation_config=generation_config
        
        
    def _load_model_and_tokenizer(self,device):
        print(self.dtype)
        config=AutoConfig.from_pretrained(self.model_dir)
        if config.model_type in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES:
            model=AutoModelForCausalLM.from_pretrained(self.model_dir,torch_dtype=self.dtype)
        else:
            model=AutoModelForSeq2SeqLM.from_pretrained(self.model_dir,torch_dtype=self.dtype)
        model.to(device)
        tokenizer=AutoTokenizer.from_pretrained(self.model_dir)
        return model,tokenizer

    # @find_executable_batch_size(starting_batch_size=1)
    # def auto_get_pred(batch_size):
        

    def get_pred(self,rank,out_path,data,dict):
        batch_size=2
        
        try:
            device = torch.device(f'cuda:{rank}')
            model, tokenizer = self._load_model_and_tokenizer(device)
            if self.dataset is not None:
                dataset=self.dataset(data=data,tokenizer=tokenizer)
            else:
                dataset=DefaultDataset(data=data,tokenizer=tokenizer)

            if self.data_collator is not None:
                collator=self.data_collator(tokenizer,model=model,padding=True)
            else:
                collator= DataCollatorWithPadding(tokenizer)
            dataloader=DataLoader(dataset,batch_size,collate_fn=collator,pin_memory=True,num_workers=0)
            result=[]
            for input in tqdm(dataloader):
                input.to(device)
                print(input)
                output = model.generate(
                            input_ids=input['input_ids'],
                            attention_mask=input['attention_mask'],
                            num_beams=5,
                            do_sample=False,
                            temperature=1.0,
                            max_new_tokens=512,
                        )
                pred = tokenizer.batch_decode(output,skip_special_tokens=True)
                print(pred)
                result+=pred
            dict[f'{rank}']=result
        except Exception as e:
            print('error',device)
            raise
        
        
          
    
    def split_list(self,lst, n):
        avg = len(lst) / float(n)
        return [lst[int(avg * i):int(avg * (i + 1))] for i in range(n)]

    def run(self,):
    
        world_size = min(torch.cuda.device_count(),len(self.data)) # corner case， data<available GPU num
        
        data_subsets = self.split_list(self.data,world_size)
        print(data_subsets)
        processes = []
        manager = mp.Manager()
        record_dict = manager.dict()
        for rank in range(world_size):

            p = mp.Process(target=self.get_pred, args=(rank,self.output_path,data_subsets[rank],record_dict))
            p.start()
            processes.append(p)
        for p in processes:
            p.join()

        with open(self.output_path, "w", encoding="utf-8") as f:
            for rank in range(world_size):
                for r in record_dict[f'{rank}']:
                    f.write(r.replace('\n','\\n')+'\n')

  
if __name__=='__main__':
    mp.set_start_method('spawn')
    i=NiuInference(model_dir=**replace here to t5 or bart**,data=['hello,how is your day','my wish is that you happy','from scratch',])
    i.run()

Expected behavior

t5 model can inference in multiprocessing.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-04-18T09:27:33Z

t5 is a fairly old model, this is probably expected? If you find a fix feel free to open a PR! 🤗

rangehow · 2024-04-18T09:33:32Z

t5 is a fairly old model, this is probably expected? If you find a fix feel free to open a PR! 🤗

Yes, but strangely enough, Bart supports it. I would be happy to give it a try, but before that, I would like to know if this issue can be reproduced? This will help me further reduce the scope of investigation.

ArthurZucker · 2024-04-30T11:29:05Z

To be honest, multiprocessing is outside of the scope of transformers, and we usually recommend the usage of accelerate 😉 . FSDP is also a possible solution, same as deepspeed. Maybe making the tutorials about that more discoverable would be the best solution

rangehow · 2024-04-30T11:42:10Z

To be honest, multiprocessing is outside of the scope of transformers, and we usually recommend the usage of accelerate 😉 . FSDP is also a possible solution, same as deepspeed. Maybe making the tutorials about that more discoverable would be the best solution

I think there is a requirement that even if there is sufficient gpu memory, we hope to distribute data to many cards, so as to use multiple GPUs for parallel inference. This behavior is somewhat similar to DDP, but does not involve the partitioning of parameters/states. Multiprocessing is a part of DDP, and I have essentially extracted the smallest part to achieve this. In 2023, I saw a 🤗staff at Forum mention to support this matter, and since I haven't seen any relevant features yet, I tried to implement it myself.
Currently, it runs correct on many models on huggingface, with only T5 experiencing this issue. At present, it may be a bit beyond my technical stack. I hope friends in the community can work together to improve this😃

rangehow · 2024-04-30T11:43:51Z

The most difficult thing for me may be that debugging in a multi process situation is very complex, and PDB cannot set breakpoints properly. 😟

hackpk · 2024-05-03T05:25:10Z

@ArthurZucker and @rangehow can I try it out?

rangehow · 2024-05-03T05:28:16Z

@ArthurZucker and @rangehow can I try it out?

Ofcourse！just do it 🎉

rangehow changed the title ~~modeling_t5 incompatiable with multiprocessing~~ modeling_t5 Incompatiblewith multiprocessing Apr 17, 2024

rangehow changed the title ~~modeling_t5 Incompatiblewith multiprocessing~~ modeling_t5 Incompatible with multiprocessing Apr 17, 2024

rangehow changed the title ~~modeling_t5 Incompatible with multiprocessing~~ modeling_t5 incompatible with multiprocessing Apr 17, 2024

ArthurZucker added Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! DeepSpeed labels Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modeling_t5 incompatible with multiprocessing #30280

modeling_t5 incompatible with multiprocessing #30280

rangehow commented Apr 17, 2024 •

edited

Loading

ArthurZucker commented Apr 18, 2024

rangehow commented Apr 18, 2024

ArthurZucker commented Apr 30, 2024

rangehow commented Apr 30, 2024

rangehow commented Apr 30, 2024

hackpk commented May 3, 2024

rangehow commented May 3, 2024

modeling_t5 incompatible with multiprocessing #30280

modeling_t5 incompatible with multiprocessing #30280

Comments

rangehow commented Apr 17, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Apr 18, 2024

rangehow commented Apr 18, 2024

ArthurZucker commented Apr 30, 2024

rangehow commented Apr 30, 2024

rangehow commented Apr 30, 2024

hackpk commented May 3, 2024

rangehow commented May 3, 2024

rangehow commented Apr 17, 2024 •

edited

Loading