-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to completion API to truncate prompt tokens #3144
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tdoublep! I'm wondering whether to align with existing vLLM parlance it might be better to rename this to truncate_prompt_tokens
?
tokenizer_kwargs: more efficient allocation Co-authored-by: Nick Hill <nickhill@us.ibm.com>
vllm-project#2879 added support for using ray to offload tokenization from the asyncio event loop. This PR extends that to support using a thread pool instead of ray, and makes that the default, with the default pool size determined based on the number of available CPU cores and the tensor parallel size. The main thing to note is that separate tokenizer instances are used per thread. This is because officially the HF tokenizers are not thread-safe. In practice I think they are unless you're making use of padding/truncation, which we aren't currently but may want to soon (see for example vllm-project#3144). Also includes some type hint additions to related parts of the code. This replaces the original PR vllm-project#3206 from before vllm-project#2879 was reworked and merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! i believe there are merge conflict since documentations were added to protocol.py
vllm/entrypoints/openai/protocol.py
Outdated
@@ -183,6 +183,7 @@ class CompletionRequest(BaseModel): | |||
guided_json: Optional[Union[str, dict, BaseModel]] = None | |||
guided_regex: Optional[str] = None | |||
guided_choice: Optional[List[str]] = None | |||
truncate_prompt_tokens: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit can you make this a constrained integer: https://docs.pydantic.dev/2.3/api/types/#pydantic.types.conint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
CI errors do not look related to these changes |
Apologies @tdoublep - it seems I submitted a simple comment as a review. That was not my intention. I think this is great and would be very excited to have this merged as we have also had to implement tokenization clientside to prevent vllm rejecting requests that are too long. My previous comment is about truncation side, as for various reasons/formats we'd either want to trim from the left or right as well and since it already a parameter that has to get set, may make sense to add both together. Thanks! |
If this is a valid use case, maybe let's support it in this PR as well? |
For reference, this is the HF tokenizer docs around truncation_side. This PR hard-codes that to "left" |
@diego898 @simon-mo could you give an example of a case where it would make sense to truncate on the right for autoregressive text generation? I can't think of one but could be a failure of imagination. I don't think the fact that HF tokenizers support this in itself is a good reason to support it here, they are much more general and used in many other contexts such as data prep and training, and with different kinds of models. |
@njhill - great point. I guess I was thinking it would depend how someone structures their context window. Ex: for RAG:
You may want to only to take from the left. but if isntead you did:
You may want trim from the right? Or
again from the right? I'm not really suggesting any of these layouts are better/worse. Several wouldnt even make sense.... Just relaying what I was thinking when suggsting to make it confirgurable. But, if vLLM decides left-side truncation is the norm/default, that would also be fne and end-users can structure their context windows accordingly! |
Thanks @diego898 TBH I don't think it would make sense to use this truncate option at all in conjunction with a system prompt. Some other form of truncation would need to be used, i.e. when applying the chat template, to retain the system prompt and then exclude the beginning of the user prompt or conversation. And I don't think that right-truncation would be sensible either for any of the examples you gave. In a chat the last thing you would want to exclude is the most recent messages, and if you truncated in the middle of some arbitrary context docs, the LM would just attempt to continue writing that particular doc. I would still vote for keeping this left-truncated only. If a concrete need arose in future it would still be easy to add the option to configure the side. |
I apologize for the confusion my request caused! I 100% agree with you @njhill - that is in fact what we do - truncate the chat history and not either side. I stretched my brain to try and describe a situation where right truncation may make sense, but didn't convince even myself! I confess, my initial request was based solely on the fact that HF has it configurable. I apologize @tdoublep for the extra work this caused you! |
I trust @njhill to decide and merge. |
073b208
to
c88be4f
Compare
@njhill I've reverted back to the version with truncation side fixed to left, and resolves some minor conflicts with changes on main branch. I think it is ready to merge. |
IDK why ruff is failing - the formatting checks are passing for me locally:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @tdoublep!
Why not also add it for Chat Completion API? |
@tdoublep Thanks for your earlier work. Just to check, am I seeing correctly that |
@m-harmonic There was a PR that tried to address this: #4598 Maybe we can check with @yecohn regarding the status. I'm happy to help finish it if required. |
Hey vLLM team - thanks for the awesome project.
This PR adds an additional option to the OpenAI completion API that allows one to truncate the number of input tokens for an individual request. We find this feature extremely useful for benchmarking and performance evaluation.
Without this option, if we want precise control of the number of input tokens, we need to implement tokenization on the client-side (e.g., in our load test environment) which introduces a bunch of dependencies. In this sense, we can live without this feature but it is super convenient to be able to do truncation on the server-side.
I have tried to keep the changes to a minimum, but if there is interest I have also implemented this for the
AsyncLLMEngine
andLLMEngine
.