Skip to content

Conversation

shenxiangzhuang
Copy link
Contributor

No description provided.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is automatically generated! # Generated content DO NOT EDIT What you can do is update the WordLevelTrainer 's signature in the bindings!

@shenxiangzhuang shenxiangzhuang changed the title chore(trainers): add __init__ to fix python type check errors in WordLevelTrainer chore(trainer): add and improve trainer signature Aug 29, 2025
@shenxiangzhuang
Copy link
Contributor Author

Hi @ArthurZucker , thanks for the reminder, I didn't see that warning before! I have restored the changes and added the signatures for BpeTrainer and WordLevelTrainer. Beside, I made a tiny space improve for WordPieceTrainer. Please review it again~

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure lists and dict are not defaulted to [] or {} and then run make in bindings/python to generate the init file

#[pyo3(signature = (**kwargs), text_signature = None)]
#[pyo3(
signature = (**kwargs),
text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={})"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={})"
text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=None, limit_alphabet=None, initial_alphabet=None, continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words=None)"

default to mutable is a mess in python

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I also notice this. There is a question, I find that text_signature in WordPieceTrainer already use []:

#[new]
    #[pyo3(
        signature = (** kwargs),
        text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet= [],continuing_subword_prefix=\"##\", end_of_word_suffix=None)"
    )]
    pub fn new(kwargs: Option<&Bound<'_, PyDict>>) -> PyResult<(Self, PyTrainer)> {

And, I tested it in python, and find that the default values are just same as the text_signature:

In [5]: import tokenizers; tokenizers.trainers.WordPieceTrainer()
Out[5]: WordPieceTrainer(WordPieceTrainer(bpe_trainer=BpeTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix="##", end_of_word_suffix=None, max_token_length=None, words={})))

And the doc for WordPieceTrainer is consistent with the real default values:

Init signature:
tokenizers.trainers.WordPieceTrainer(
    self,
    vocab_size=30000,
    min_frequency=0,
    show_progress=True,
    special_tokens=[],
    limit_alphabet=None,
    initial_alphabet=[],
    continuing_subword_prefix='##',
    end_of_word_suffix=None,
)
Docstring:     
Trainer capable of training a WordPiece model

...(ignored)

So maybe we should keep the text_signature same the real default value to avoid introduce break changes and confuse the users(due to the inconsistence) too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok no worries then

@shenxiangzhuang shenxiangzhuang marked this pull request as draft August 29, 2025 09:21
@shenxiangzhuang shenxiangzhuang force-pushed the chore/fix_trainer_types branch from 74ac182 to 0547793 Compare August 29, 2025 09:31
@shenxiangzhuang
Copy link
Contributor Author

shenxiangzhuang commented Aug 29, 2025

To ease the review process, I copied current default values of different trainers here:

In [1]: import tokenizers

In [2]: tokenizers.trainers.BpeTrainer()
Out[2]: BpeTrainer(BpeTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={}))

In [3]: tokenizers.trainers.UnigramTrainer()
Out[3]: UnigramTrainer(UnigramTrainer(show_progress=True, vocab_size=8000, n_sub_iterations=2, shrinking_factor=0.75, special_tokens=[], initial_alphabet=[], unk_token=None, max_piece_length=16, seed_size=1000000, words={}))

In [4]: tokenizers.trainers.WordLevelTrainer()
Out[4]: WordLevelTrainer(WordLevelTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[], words={}))

In [5]: tokenizers.trainers.WordPieceTrainer()
Out[5]: WordPieceTrainer(WordPieceTrainer(bpe_trainer=BpeTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix="##", end_of_word_suffix=None, max_token_length=None, words={})))

In [6]: tokenizers.__version__
Out[6]: '0.21.4-dev.0'

@shenxiangzhuang shenxiangzhuang marked this pull request as ready for review August 29, 2025 09:48
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 😉

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker merged commit c0d3697 into huggingface:main Sep 16, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants