Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A TypeError in modeling_utils.caching_allocator_warmup function #37074

Closed
2 of 4 tasks
ZeroMakesAll opened this issue Mar 28, 2025 · 4 comments · Fixed by #37144
Closed
2 of 4 tasks

A TypeError in modeling_utils.caching_allocator_warmup function #37074

ZeroMakesAll opened this issue Mar 28, 2025 · 4 comments · Fixed by #37144
Labels

Comments

@ZeroMakesAll
Copy link

System Info

  • transformers version: 4.50.2
  • Platform: Linux-5.15.0-1040-nvidia-x86_64-with-glibc2.35
  • Python version: 3.12.9
  • Huggingface_hub version: 0.29.3
  • Safetensors version: 0.5.3
  • Accelerate version: 1.5.2
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (GPU?): 2.6.0+cu124 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA H800

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Save bool values in model params
  2. Load model use <device_map="auto">
  3. An error occurred in modeling_utils.caching_allocator_warmup (line 5854), because one bool value takes 1/8 byte and then the type of byte_count is float

Expected behavior

Before allocating video memory, do a type check on the byte_count

@Rocketknight1
Copy link
Member

Rocketknight1 commented Mar 28, 2025

Hi @ZeroMakesAll, I'm not sure I understand. torch.bool actually uses 8 bits per entry, so that entries are byte-aligned.

>>> x = torch.ones((32768, 32768), dtype=torch.bool, device="cuda")
>>> x.untyped_storage().nbytes()
1073741824  # 32768 * 32768, 1 byte per entry

@ZeroMakesAll
Copy link
Author

@Rocketknight1 Thanks, the information you provided was very helpful to me. However, I found that transformers define the size of bool here (modeling_utils.dtype_byte_size)

if dtype == torch.bool:
    return 1 / 8

Image

Seems that huggingface use this function to estimate memory allocation. It return a float and cause the TypeError in modeling_utils line 5854.

_ = torch.empty(byte_count // 2, dtype=torch.float16, device=device, requires_grad=False)

Here, byte_count can't be a float

@Rocketknight1
Copy link
Member

Hi @ZeroMakesAll, thanks for that! This is definitely a bug in dtype_byte_size. I'll make a PR to fix it.

@Rocketknight1
Copy link
Member

Fix open at #37144

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants