-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📦 [SFT] Deprecate batched formatting_func
#3147
Conversation
Applying the function to the entire dataset just to check if it is batched doesn’t seem reasonable. In fact, That said, determining whether the function is batched is trickier than it seems. Also, I realize that we never document or test support for a non-batched function. The simplest approach would be to assume that the function is always batched, right? |
if formatting_func is not None and not is_processed:
if isinstance(dataset, Dataset): # `IterableDataset.map` does not support `desc`
map_kwargs["desc"] = f"Applying formatting function to {dataset_name} dataset"
def _func(example):
return {"text": formatting_func(example)}
dataset = dataset.map(_func, batched=True, **map_kwargs) |
In fact, there is an instance of trl/docs/source/sft_trainer.md Lines 291 to 293 in a0a5317
If only the batched function is supported, we should update the documentation. |
Indeed! Let's update the doc then |
@qgallouedec Please review the code. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
The test script is still using a non-batched |
I revisited this issue and explored some open codebases. Unfortunately, both forms are widely used. Here’s my suggestion: try:
dataset = dataset.map(_func, batched=False, **map_kwargs)
except Exception as e:
warnings.warn(
f"Failed to apply the formatting function due to the following error: {e}. This may be "
"because the function is designed for batched input. Please update it to process one example "
"at a time (i.e., accept and return a single example). For now, we will attempt to apply the "
"function in batched mode, but note that batched formatting is deprecated and will be removed "
"in version 0.21.",
DeprecationWarning,
)
dataset = dataset.map(_func, batched=True, **map_kwargs) Apologies for the change of direction and any extra work this may cause. |
formatting_func
returning a listformatting_func
Thank you for updating the code and submitting the PR. I really appreciate your help during my break! |
What does this PR do?
When I defined
formatting_func
to return a list of processed strings following the documentation of SFTTrainer, anIndexError
was raised.I initially thought it was a documentation error and submitted a PR (#3141).
However, after reviewing the code, I found that it was a bug in the batched judgement. The issue was located in the following line:
trl/trl/trainer/sft_trainer.py
Line 413 in dee3734
After fixing the bug, the code is now compatible with
formatting_func
returning either a processed string or a list.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines.
Who can review?
@qgallouedec