-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support indicating prefix token of chat template #28473
base: main
Are you sure you want to change the base?
Conversation
Hi @congchan - firstly, apologies for taking so long to get to this one - it slipped past me the first time I was pinged! This seems like a clean PR, but I'm not sure we can accept it as-is: The list of special tokens that we have specific code for is very short, and I think this would make more sense as an added token in models that support it, since most models will not. However, you're not the only user who wants a clean way to separate user and assistant messages in the tokens from |
Hi, thanks for your feedback. Indeed it is better to keep the special tokens shorts. Besides, I suggest The reason is, in production environment with multi-turns dataset curated or bad case hot fixing, we can modify some specific turns to become high-quality without changing the rest of the other turns. User can choose to train their model to learn only specific turns that they believe to be high quality, and ignore others.
I have already been using this in-out pipeline in my local training(but not yet make use of the What do you think? I can also help on it. |
What does this PR do?
In chat language model training, sometimes we need to mask the input from real users, and train the model solely from assistant's outputs.
This PR add a special prefix token, which can be applied in
chat_template
, so that we can make use of thisprefix_token
to dynamically separate dialogs fromuser
andassistant
.For example:
The prefix_token could be
<|im_start|>assistant\n
, we can make use of this token:chat_template
, for example{% if add_generation_prompt %}{{ prefix_token }}
tokenizer.prefix_token
andtokenizer.eos_token
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.