-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Prevent going past token limit in OpenAI calls in PromptNode #4179
Conversation
… PromptNode and OpenAIAnswerGenerator
…automatically determine correct tokenizer for the requested model
Hey @sjrl - I just tried this branch out with @julian-risch and by calling the following:
We get the following response from OpenAI:
|
Hey @TuanaCelik thanks for checking this and catching this bug! I didn't take into account the answer length into the token limit. I'll do that now. |
…ncation amount. Moved truncation higher up to PromptNode level.
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
@sjrl just a question, wouldn't we need the actual vocabulary of GPT-3.5 to check the number of tokens reliably. GPT-2 might tokenize differently, am I wrong? Or is this more of an approximation and we are fine with some divergence? |
That is a great point @mathislucka, which is why we also support for the tiktoken library which is the official OpenAI repo for tokenizing GPT-3.5 models. However, we have the GPT-2 tokenizer as a fallback because the tiktoken library does not have wheels built for ARM64 Linux (issue here) so we aren't able to provide it in the Haystack Docker Image. |
if "davinci" in model_name: | ||
max_tokens_limit = 4000 | ||
if USE_TIKTOKEN: | ||
tokenizer_name = MODEL_TO_ENCODING.get(model_name, "p50k_base") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mathislucka This is where we automatically load the correct GPT-3.5 tokenizer if the tiktoken library is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost ready to go. 👍 The test case for the token limit is failing because the log message is different than the one we check for in the test. This needs to be changed. There are two lines with tt = PromptTemplate
that you could refactor to prompt_template = PromptTemplate
to increase readability. Other than that the changes look good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! 🚀
I believe this PR has broken the use of flan-t5 models, which do not have a token limit (well, there is one in the tokenizer, see https://huggingface.co/google/flan-t5-xl/blob/main/tokenizer_config.json#L106, but that is NOT limiting what that model can actually handle, so for that type of model automatically using the limit from the tokenizer is wrong). This is because the T5 models use a relative attention mechanism and so can handle any sequence with the only constraint being the memory of the GPU. (see google-research/text-to-text-transfer-transformer#273) Now automatically the 512 token limit from the tokenizer is applied to the prompt supplied to the flan-t5 models. Shouldn't there be a way to overwrite this new token limit coming from the tokenizer for scenarios like this? |
Related Issues
Proposed Changes:
retry_with_exponential_backoff
has been moved to one location. Wraps the new util functionopenai_request
which makes therequest
to the OpenAI API and handles the raising of appropriate errors and retries. Tested locally to make sure the retry mechanism still works.MODEL_TO_ENCODING
added intiktoken==0.2.0
so we can automatically look up the correct tokenizer for the requested model.max_token_limit
. I considered using the solution in theOpenAIAnswerGenerator
since it removes documents from the context until it fits within the max token length, which I think is a better solution for Retrieval Augmented QA. However, for thePromptNode
it is not easily possible to determine the use case at model invocation time so I opted to truncate the end of the prompt instead and log a warning to the user.How did you test it?
Notes for the reviewer
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.