Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modeling_t5 incompatible with multiprocessing #30280

Open
2 of 4 tasks
rangehow opened this issue Apr 17, 2024 · 7 comments
Open
2 of 4 tasks

modeling_t5 incompatible with multiprocessing #30280

rangehow opened this issue Apr 17, 2024 · 7 comments
Labels
DeepSpeed Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!

Comments

@rangehow
Copy link

rangehow commented Apr 17, 2024

System Info

  • transformers version: 4.39.0.dev0
  • Platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.10.13
  • Huggingface_hub version: 0.21.4
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: DEEPSPEED
    - mixed_precision: bf16
    - use_cpu: False
    - debug: False
    - num_processes: 8
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - deepspeed_config: {'gradient_accumulation_steps': 16, 'zero3_init_flag': False, 'zero_stage': 0}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • PyTorch version (GPU?): 2.2.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

Hi, @ArthurZucker and @younesbelkada . I'm trying to split a dataset automatically to multi gpu (a bit like data parallel) for inference. But strange things happen when using t5 model in hf while other models work correctly(i.e. bart), so I guess here exist some problem related to t5 implementation, would you like help checking it out? :)

Although it has been mentioned online that the error below may be related to OOM, I am certain that it is not. The following code only allows rank0 to obtain normal output, while other ranks will report the following error.

Traceback (most recent call last):
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/data/ruanjh/NiuInference/NiuInference.py", line 97, in get_pred
    output = model.generate(
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/generation/utils.py", line 1388, in generate
    model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/generation/utils.py", line 503, in _prepare_encoder_decoder_kwargs_for_generation
    model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 1115, in forward
    layer_outputs = layer_module(
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 695, in forward
    self_attention_outputs = self.layer[0](
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 602, in forward
    attention_output = self.SelfAttention(
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 521, in forward
    query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length, dim_per_head)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The following code should be quite easy to reproduce. All you need to do is replace the model_dir in the main function with a specific model, such as Google/t5-v1_1-large , and make sure CUDA VISIBLE DEVICES >1 .

import torch
from torch import bfloat16
import torch.distributed as dist
import torch.multiprocessing as mp

from torch.utils.data import Dataset,DataLoader
import functools
from transformers import AutoTokenizer,DefaultDataCollator,GenerationConfig,PreTrainedModel,AutoModelForSeq2SeqLM,AutoModelForCausalLM,AutoConfig,DataCollatorWithPadding
import logging
from transformers.models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
from tqdm import tqdm
# from accelerate import find_executable_batch_size

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)



class DefaultDataset(Dataset):
    def __init__(self,data,tokenizer):
        self.data=tokenizer(data,return_tensors='pt',padding=True)

    
    def __getitem__(self,idx):
        return {'input_ids':self.data['input_ids'][idx]}
    
    def __len__(self):
        return self.data['input_ids'].size(0)




class NiuInference:
    def __init__(self,model_dir,data,dtype=bfloat16,dataset=None,data_collator=None,output_path='niuinference.out',auto_batch_size=True,batch_size=1,generation_config=None):
        self.model_dir=model_dir
        self.dtype=dtype
        self.data=data
        self.dataset=dataset
        self.data_collator=data_collator
        self.output_path=output_path
        self.batch_size=batch_size
        self.auto_batch_size=auto_batch_size
        self.generation_config=generation_config
        
        
    def _load_model_and_tokenizer(self,device):
        print(self.dtype)
        config=AutoConfig.from_pretrained(self.model_dir)
        if config.model_type in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES:
            model=AutoModelForCausalLM.from_pretrained(self.model_dir,torch_dtype=self.dtype)
        else:
            model=AutoModelForSeq2SeqLM.from_pretrained(self.model_dir,torch_dtype=self.dtype)
        model.to(device)
        tokenizer=AutoTokenizer.from_pretrained(self.model_dir)
        return model,tokenizer

    # @find_executable_batch_size(starting_batch_size=1)
    # def auto_get_pred(batch_size):
        

    def get_pred(self,rank,out_path,data,dict):
        batch_size=2
        
        try:
            device = torch.device(f'cuda:{rank}')
            model, tokenizer = self._load_model_and_tokenizer(device)
            if self.dataset is not None:
                dataset=self.dataset(data=data,tokenizer=tokenizer)
            else:
                dataset=DefaultDataset(data=data,tokenizer=tokenizer)

            if self.data_collator is not None:
                collator=self.data_collator(tokenizer,model=model,padding=True)
            else:
                collator= DataCollatorWithPadding(tokenizer)
            dataloader=DataLoader(dataset,batch_size,collate_fn=collator,pin_memory=True,num_workers=0)
            result=[]
            for input in tqdm(dataloader):
                input.to(device)
                print(input)
                output = model.generate(
                            input_ids=input['input_ids'],
                            attention_mask=input['attention_mask'],
                            num_beams=5,
                            do_sample=False,
                            temperature=1.0,
                            max_new_tokens=512,
                        )
                pred = tokenizer.batch_decode(output,skip_special_tokens=True)
                print(pred)
                result+=pred
            dict[f'{rank}']=result
        except Exception as e:
            print('error',device)
            raise
        
        
          
    
    def split_list(self,lst, n):
        avg = len(lst) / float(n)
        return [lst[int(avg * i):int(avg * (i + 1))] for i in range(n)]

    def run(self,):
    
        world_size = min(torch.cuda.device_count(),len(self.data)) # corner case, data<available GPU num
        
        data_subsets = self.split_list(self.data,world_size)
        print(data_subsets)
        processes = []
        manager = mp.Manager()
        record_dict = manager.dict()
        for rank in range(world_size):

            p = mp.Process(target=self.get_pred, args=(rank,self.output_path,data_subsets[rank],record_dict))
            p.start()
            processes.append(p)
        for p in processes:
            p.join()

        with open(self.output_path, "w", encoding="utf-8") as f:
            for rank in range(world_size):
                for r in record_dict[f'{rank}']:
                    f.write(r.replace('\n','\\n')+'\n')

  
if __name__=='__main__':
    mp.set_start_method('spawn')
    i=NiuInference(model_dir=**replace here to t5 or bart**,data=['hello,how is your day','my wish is that you happy','from scratch',])
    i.run()

Expected behavior

t5 model can inference in multiprocessing.

@rangehow rangehow changed the title modeling_t5 incompatiable with multiprocessing modeling_t5 Incompatiblewith multiprocessing Apr 17, 2024
@rangehow rangehow changed the title modeling_t5 Incompatiblewith multiprocessing modeling_t5 Incompatible with multiprocessing Apr 17, 2024
@rangehow rangehow changed the title modeling_t5 Incompatible with multiprocessing modeling_t5 incompatible with multiprocessing Apr 17, 2024
@ArthurZucker
Copy link
Collaborator

t5 is a fairly old model, this is probably expected? If you find a fix feel free to open a PR! 🤗

@ArthurZucker ArthurZucker added Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! DeepSpeed labels Apr 18, 2024
@rangehow
Copy link
Author

t5 is a fairly old model, this is probably expected? If you find a fix feel free to open a PR! 🤗

Yes, but strangely enough, Bart supports it. I would be happy to give it a try, but before that, I would like to know if this issue can be reproduced? This will help me further reduce the scope of investigation.

@ArthurZucker
Copy link
Collaborator

To be honest, multiprocessing is outside of the scope of transformers, and we usually recommend the usage of accelerate 😉 . FSDP is also a possible solution, same as deepspeed. Maybe making the tutorials about that more discoverable would be the best solution

@rangehow
Copy link
Author

To be honest, multiprocessing is outside of the scope of transformers, and we usually recommend the usage of accelerate 😉 . FSDP is also a possible solution, same as deepspeed. Maybe making the tutorials about that more discoverable would be the best solution

I think there is a requirement that even if there is sufficient gpu memory, we hope to distribute data to many cards, so as to use multiple GPUs for parallel inference. This behavior is somewhat similar to DDP, but does not involve the partitioning of parameters/states. Multiprocessing is a part of DDP, and I have essentially extracted the smallest part to achieve this. In 2023, I saw a 🤗staff at Forum mention to support this matter, and since I haven't seen any relevant features yet, I tried to implement it myself.
Currently, it runs correct on many models on huggingface, with only T5 experiencing this issue. At present, it may be a bit beyond my technical stack. I hope friends in the community can work together to improve this😃

@rangehow
Copy link
Author

The most difficult thing for me may be that debugging in a multi process situation is very complex, and PDB cannot set breakpoints properly. 😟

@hackpk
Copy link
Contributor

hackpk commented May 3, 2024

@ArthurZucker and @rangehow can I try it out?

@rangehow
Copy link
Author

rangehow commented May 3, 2024

@ArthurZucker and @rangehow can I try it out?

Ofcourse!just do it 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DeepSpeed Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!
Projects
None yet
Development

No branches or pull requests

3 participants