Skip to content

Commit 4115ba0

Browse files
committed
Merge remote-tracking branch 'upstream/main' into improve-support-nested-fast-image-proc
2 parents 8c4f74f + 0f77ca7 commit 4115ba0

File tree

109 files changed

+1525
-775
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

109 files changed

+1525
-775
lines changed

.github/workflows/pr-style-bot.yml

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# To run this bot, comment "@bot /style" on a PR
2+
name: Style Bot
3+
4+
on:
5+
issue_comment:
6+
types: [created]
7+
8+
permissions:
9+
contents: write
10+
pull-requests: write
11+
12+
jobs:
13+
style:
14+
uses: huggingface/huggingface_hub/.github/workflows/style-bot-action.yml@main
15+
with:
16+
python_quality_dependencies: "[quality]"
17+
style_command_type: "default"
18+
secrets:
19+
bot_token: ${{ secrets.GITHUB_TOKEN }}

.github/workflows/self-comment-ci.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ jobs:
2929
runs-on: ubuntu-22.04
3030
name: Get PR number
3131
# For security: only allow team members to run
32-
if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "qubvel", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "muellerzr", "eustlb", "MekkCyber"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') || startsWith(github.event.comment.body, 'run slow') || startsWith(github.event.comment.body, 'run_slow')) }}
32+
if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "qubvel", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "muellerzr", "eustlb", "MekkCyber", "manueldeprada"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') || startsWith(github.event.comment.body, 'run slow') || startsWith(github.event.comment.body, 'run_slow')) }}
3333
outputs:
3434
PR_NUMBER: ${{ steps.set_pr_number.outputs.PR_NUMBER }}
3535
steps:

README.md

+5
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,12 @@ Install Transformers from source if you want the latest changes in the library o
9898
```shell
9999
git clone https://github.com/huggingface/transformers.git
100100
cd transformers
101+
102+
# pip
101103
pip install .[torch]
104+
105+
# uv
106+
uv pip install .[torch]
102107
```
103108

104109
## Quickstart

docs/source/en/generation_strategies.md

+244-71
Large diffs are not rendered by default.

docs/source/en/model_doc/albert.md

+1
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). This
5757
- Embedding size E is different from hidden size H justified because the embeddings are context independent (one embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V being the vocab size). If E < H, it has less parameters.
5858
- Layers are split in groups that share parameters (to save memory).
5959
Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have been swapped or not.
60+
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
6061

6162
### Using Scaled Dot Product Attention (SDPA)
6263

docs/source/en/model_doc/bart.md

+1
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The
5555
* mask a span of k tokens with a single mask token (a span of 0 tokens is an insertion of a mask token)
5656
* permute sentences
5757
* rotate the document to make it start at a specific token
58+
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
5859

5960
## Implementation Notes
6061

docs/source/en/model_doc/biogpt.md

+1
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ This model was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The
3636
- BioGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left.
3737
- BioGPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows BioGPT to generate syntactically coherent text as it can be observed in the run_generation.py example script.
3838
- The model can take the `past_key_values` (for PyTorch) as input, which is the previously computed key/value attention pairs. Using this (past_key_values or past) value prevents the model from re-computing pre-computed values in the context of text generation. For PyTorch, see past_key_values argument of the BioGptForCausalLM.forward() method for more information on its usage.
39+
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
3940

4041
### Using Scaled Dot Product Attention (SDPA)
4142

docs/source/en/model_doc/data2vec.md

+1
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ The original code for vision can be found [here](https://github.com/facebookrese
5353
- For Data2VecAudio, preprocessing is identical to [`Wav2Vec2Model`], including feature extraction
5454
- For Data2VecText, preprocessing is identical to [`RobertaModel`], including tokenization.
5555
- For Data2VecVision, preprocessing is identical to [`BeitModel`], including feature extraction.
56+
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
5657

5758
### Using Scaled Dot Product Attention (SDPA)
5859

docs/source/en/model_doc/gpt_bigcode.md

+4
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,12 @@ The main differences compared to GPT2.
4646
- Merge the key and value caches into one (this changes the format of layer_past/ present, does it risk creating problems?)
4747
- Use the memory layout (self.num_heads, 3, self.head_dim) instead of `(3, self.num_heads, self.head_dim)` for the QKV tensor with MHA. (prevents an overhead with the merged key and values, but makes the checkpoints incompatible with the original openai-community/gpt2 model).
4848

49+
4950
You can read more about the optimizations in the [original pull request](https://github.com/huggingface/transformers/pull/22575)
5051

52+
> [!NOTE]
53+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
54+
5155
## Combining Starcoder and Flash Attention 2
5256

5357
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.

docs/source/en/model_doc/hubert.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
5050
- Hubert is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
5151
- Hubert model was fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded
5252
using [`Wav2Vec2CTCTokenizer`].
53-
53+
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
5454

5555
## Using Flash Attention 2
5656

docs/source/en/model_doc/m2m_100.md

+3
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,9 @@ multilingual it expects the sequences in a certain format: A special language id
5151
source and target text. The source text format is `[lang_code] X [eos]`, where `lang_code` is source language
5252
id for source text and target language id for target text, with `X` being the source or target text.
5353

54+
> [!NOTE]
55+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
56+
5457
The [`M2M100Tokenizer`] depends on `sentencepiece` so be sure to install it before running the
5558
examples. To install `sentencepiece` run `pip install sentencepiece`.
5659

docs/source/en/model_doc/mbart.md

+3
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,9 @@ You can find all the original mBART checkpoints under the [AI at Meta](https://h
3535
> [!TIP]
3636
> Click on the mBART models in the right sidebar for more examples of applying mBART to different language tasks.
3737
38+
> [!NOTE]
39+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
40+
3841
The example below demonstrates how to translate text with [`Pipeline`] or the [`AutoModel`] class.
3942

4043
<hfoptions id="usage">

docs/source/en/model_doc/musicgen.md

+3
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,9 @@ python src/transformers/models/musicgen/convert_musicgen_transformers.py \
6262
--checkpoint small --pytorch_dump_folder /output/path --safe_serialization
6363
```
6464

65+
> [!NOTE]
66+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
67+
6568
## Generation
6669

6770
MusicGen is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly

docs/source/en/model_doc/musicgen_melody.md

+3
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,9 @@ There are two key differences with MusicGen:
4444
1. The audio prompt is used here as a conditional signal for the generated audio sample, whereas it's used for audio continuation in [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen).
4545
2. Conditional text and audio signals are concatenated to the decoder's hidden states instead of being used as a cross-attention signal, as in MusicGen.
4646

47+
> [!NOTE]
48+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
49+
4750
## Generation
4851

4952
MusicGen Melody is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly better results than greedy, thus we encourage sampling mode to be used where possible. Sampling is enabled by default, and can be explicitly specified by setting `do_sample=True` in the call to [`MusicgenMelodyForConditionalGeneration.generate`], or by overriding the model's generation config (see below).

docs/source/en/model_doc/opt.md

+3
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,9 @@ Tips:
4141
- OPT has the same architecture as [`BartDecoder`].
4242
- Contrary to GPT2, OPT adds the EOS token `</s>` to the beginning of every prompt.
4343

44+
> [!NOTE]
45+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
46+
4447
## Resources
4548

4649
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with OPT. If you're

docs/source/en/model_doc/qwen2_audio.md

+3
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,9 @@ The abstract from the paper is the following:
4040

4141
`Qwen2-Audio-7B` and `Qwen2-Audio-7B-Instruct` can be found on the [Huggingface Hub](https://huggingface.co/Qwen)
4242

43+
> [!NOTE]
44+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
45+
4346
### Inference
4447

4548
```python

docs/source/en/model_doc/sew.md

+3
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,9 @@ This model was contributed by [anton-l](https://huggingface.co/anton-l).
4646
- SEWForCTC is fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded using
4747
[`Wav2Vec2CTCTokenizer`].
4848

49+
> [!NOTE]
50+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
51+
4952
## Resources
5053

5154
- [Audio classification task guide](../tasks/audio_classification)

docs/source/en/model_doc/unispeech-sat.md

+3
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,9 @@ found [here](https://github.com/microsoft/UniSpeech/tree/main/UniSpeech-SAT).
5454
decoded using [`Wav2Vec2CTCTokenizer`].
5555
- UniSpeechSat performs especially well on speaker verification, speaker identification, and speaker diarization tasks.
5656

57+
> [!NOTE]
58+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
59+
5760
## Resources
5861

5962
- [Audio classification task guide](../tasks/audio_classification)

docs/source/en/model_doc/unispeech.md

+3
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,9 @@ found [here](https://github.com/microsoft/UniSpeech/tree/main/UniSpeech).
4949
- UniSpeech model can be fine-tuned using connectionist temporal classification (CTC) so the model output has to be
5050
decoded using [`Wav2Vec2CTCTokenizer`].
5151

52+
> [!NOTE]
53+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
54+
5255
## Resources
5356

5457
- [Audio classification task guide](../tasks/audio_classification)

docs/source/en/model_doc/vilt.md

+5
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,11 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
7272
[[autodoc]] ViltImageProcessor
7373
- preprocess
7474

75+
## ViltImageProcessorFast
76+
77+
[[autodoc]] ViltImageProcessorFast
78+
- preprocess
79+
7580
## ViltProcessor
7681

7782
[[autodoc]] ViltProcessor

docs/source/en/model_doc/wav2vec2.md

+3
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,9 @@ Note: Meta (FAIR) released a new version of [Wav2Vec2-BERT 2.0](https://huggingf
5050
- Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be decoded
5151
using [`Wav2Vec2CTCTokenizer`].
5252

53+
> [!NOTE]
54+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
55+
5356
## Using Flash Attention 2
5457

5558
Flash Attention 2 is an faster, optimized version of the model.

docs/source/en/model_doc/whisper.md

+3
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,9 @@ rendered properly in your Markdown viewer.
3232

3333
You can find all the original Whisper checkpoints under the [Whisper](https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013) collection.
3434

35+
> [!NOTE]
36+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
37+
3538
> [!TIP]
3639
> Click on the Whisper models in the right sidebar for more examples of how to apply Whisper to different audio tasks.
3740

docs/source/en/trainer.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -372,14 +372,14 @@ accelerate launch \
372372

373373
### torch.compile
374374

375-
[torch.compile](./perf_torch_compile) can significantly speed up training and reduce computational overhead. Configure your torch.compile settings in [`TrainingArguments`]. Set `torch.compile` to `True`, and select a backend and compile mode.
375+
[torch.compile](./perf_torch_compile) can significantly speed up training and reduce computational overhead. Configure your torch.compile settings in [`TrainingArguments`]. Set `torch_compile` to `True`, and select a backend and compile mode.
376376

377377
```py
378378
from transformers import TrainingArguments
379379
380380
training_args = TrainingArguments(
381-
torch.compile=True,
382-
torch.compile_backend="inductor",
381+
torch_compile=True,
382+
torch_compile_backend="inductor",
383383
torch_compile_mode="default",
384384
...,
385385
)

examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py

+2
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
from pathlib import Path
2222
from typing import Optional, Union
2323

24+
import aiohttp
2425
import datasets
2526
import torch
2627
from accelerate import Accelerator
@@ -454,6 +455,7 @@ def main():
454455
split=train_split_name,
455456
cache_dir=args.cache_dir,
456457
trust_remote_code=args.trust_remote_code,
458+
storage_options={"client_kwargs": {"timeout": aiohttp.ClientTimeout(total=60 * 60)}},
457459
)
458460
datasets_splits.append(dataset_split)
459461

src/transformers/cache_utils.py

+6-4
Original file line numberDiff line numberDiff line change
@@ -464,7 +464,7 @@ def get_max_cache_shape(self) -> Optional[int]:
464464
"""Returns the maximum sequence length of the cache object. DynamicCache does not have a maximum length."""
465465
return None
466466

467-
def to_legacy_cache(self) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]:
467+
def to_legacy_cache(self) -> Tuple[Tuple[torch.Tensor, torch.Tensor]]:
468468
"""Converts the `DynamicCache` instance into the its equivalent in the legacy cache format. Used for
469469
backward compatibility."""
470470
legacy_cache = ()
@@ -473,7 +473,9 @@ def to_legacy_cache(self) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]:
473473
return legacy_cache
474474

475475
@classmethod
476-
def from_legacy_cache(cls, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None) -> "DynamicCache":
476+
def from_legacy_cache(
477+
cls, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor, torch.FloatTensor]]] = None
478+
) -> "DynamicCache":
477479
"""Converts a cache in the legacy cache format into an equivalent `DynamicCache`. Used for
478480
backward compatibility."""
479481
cache = cls()
@@ -1505,8 +1507,8 @@ def __len__(self):
15051507
"""
15061508
return len(self.self_attention_cache)
15071509

1508-
def to_legacy_cache(self) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]:
1509-
"""Converts the `EncoderDecoderCache` instance into its equivalent in the legacy cache format."""
1510+
def to_legacy_cache(self) -> Tuple[Tuple[torch.Tensor]]:
1511+
"""Converts the `EncoderDecoderCache` instance into its equivalent in the legacy cache format."""
15101512
legacy_cache = ()
15111513
if len(self.cross_attention_cache) > 0:
15121514
for self_attn, cross_attn in zip(

0 commit comments

Comments
 (0)