-
Notifications
You must be signed in to change notification settings - Fork 966
chore(trainer): add and improve trainer signature #1838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(trainer): add and improve trainer signature #1838
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is automatically generated! # Generated content DO NOT EDIT
What you can do is update the WordLevelTrainer
's signature in the bindings!
__init__
to fix python type check errors in WordLevelTrainer
Hi @ArthurZucker , thanks for the reminder, I didn't see that warning before! I have restored the changes and added the signatures for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure lists and dict are not defaulted to [] or {} and then run make
in bindings/python
to generate the init file
#[pyo3(signature = (**kwargs), text_signature = None)] | ||
#[pyo3( | ||
signature = (**kwargs), | ||
text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={})" | |
text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=None, limit_alphabet=None, initial_alphabet=None, continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words=None)" |
default to mutable is a mess in python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I also notice this. There is a question, I find that text_signature
in WordPieceTrainer
already use []
:
#[new]
#[pyo3(
signature = (** kwargs),
text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet= [],continuing_subword_prefix=\"##\", end_of_word_suffix=None)"
)]
pub fn new(kwargs: Option<&Bound<'_, PyDict>>) -> PyResult<(Self, PyTrainer)> {
And, I tested it in python, and find that the default values are just same as the text_signature
:
In [5]: import tokenizers; tokenizers.trainers.WordPieceTrainer()
Out[5]: WordPieceTrainer(WordPieceTrainer(bpe_trainer=BpeTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix="##", end_of_word_suffix=None, max_token_length=None, words={})))
And the doc for WordPieceTrainer
is consistent with the real default values:
Init signature:
tokenizers.trainers.WordPieceTrainer(
self,
vocab_size=30000,
min_frequency=0,
show_progress=True,
special_tokens=[],
limit_alphabet=None,
initial_alphabet=[],
continuing_subword_prefix='##',
end_of_word_suffix=None,
)
Docstring:
Trainer capable of training a WordPiece model
...(ignored)
So maybe we should keep the text_signature same the real default value to avoid introduce break changes and confuse the users(due to the inconsistence) too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok no worries then
74ac182
to
0547793
Compare
To ease the review process, I copied current default values of different trainers here: In [1]: import tokenizers
In [2]: tokenizers.trainers.BpeTrainer()
Out[2]: BpeTrainer(BpeTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={}))
In [3]: tokenizers.trainers.UnigramTrainer()
Out[3]: UnigramTrainer(UnigramTrainer(show_progress=True, vocab_size=8000, n_sub_iterations=2, shrinking_factor=0.75, special_tokens=[], initial_alphabet=[], unk_token=None, max_piece_length=16, seed_size=1000000, words={}))
In [4]: tokenizers.trainers.WordLevelTrainer()
Out[4]: WordLevelTrainer(WordLevelTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[], words={}))
In [5]: tokenizers.trainers.WordPieceTrainer()
Out[5]: WordPieceTrainer(WordPieceTrainer(bpe_trainer=BpeTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix="##", end_of_word_suffix=None, max_token_length=None, words={})))
In [6]: tokenizers.__version__
Out[6]: '0.21.4-dev.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 😉
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
No description provided.