You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that when the prompt truncation PR was merged for OpenAI #4179 this made it so that if the specified maximum length in the PromptNode definition is equal to or larger than the models own max sequence length, then the truncation can also truncate the prompt itself + documents too e.g. for models that have a significantly lower token limit like flan models. So the Prompt is empty, and documents are also truncated.
To Reproduce
from haystack.nodes import EmbeddingRetriever, PromptNode, PromptTemplate
from haystack.document_stores import WeaviateDocumentStore
from haystack.pipelines import Pipeline
import sys
document_store = WeaviateDocumentStore(similarity="cosine", embedding_dim=768)
lfqa_prompt = PromptTemplate(
name="lfqa",
prompt_text="Generate a comprehensive, summarized answer to the given question using the provided paragraphs and reply as if you are Yoda. \n\n Paragraphs: $documents \n\n Question: $query \n\n Answer:",
)
retriever = EmbeddingRetriever(
document_store=document_store,
embedding_model="flax-sentence-embeddings/all_datasets_v3_mpnet-base",
model_format="sentence_transformers",
top_k=20,
)
prompt_node = PromptNode(
model_name_or_path="google/flan-t5-xl",
default_prompt_template=lfqa_prompt,
top_k=9,
max_length=512,
)
pipe = Pipeline()
pipe.add_node(component=retriever, name="retriever", inputs=["Query"])
pipe.add_node(component=prompt_node, name="prompt_node", inputs=["retriever"])
output = pipe.run(query=sys.argv[1])
full_sentence = list(filter(lambda x: x.endswith("."), output["results"]))
longest_output = max(full_sentence, key=len)
print("**Answer:** " + longest_output
I think that we should not truncate from Prompt to Answer, but in the opposite direction. And we warn the user:
“Your prompt uses Z tokens, this LLM has a max seq. length of X tokens, you requested an Answer token space of Y tokens, which would exacerbate the model limits by I tokens. You may get no Answer. Please reduce your documents size or numbers, and/or your requested max Answer size.”
Going the opposite direction means that we are reserving space for an Answer and not minding about the prompt, documents, and the query itself. Any LLM without a prompt would be unusable, and the consequence is no answer or just a random-generated answer (noise).
Hey everyone, @zoltan-fedor also brought up a really good point in this issue #4388. For at least the flan-T5 models it doesn't seem to make sense to have this token limit enforced. Check out this comment from the linked issue for more info.
Yes. FLAN limit set by the tokenizer is not really a hard limit because of the attention mech. There is a post about this on the HF forum.
It's a common fit as the memory usage increase exponentially.
@sjrl but anyway, even if not enforcing the recommended value which fits most scenarios, there are other models and use cases, e.g., even for OpenAI if we set the answer length to a value of 4000. So, maybe changing the way we warn users or the message would be advisable.
Bug was discovered by @recrudesce @rolandtannous and @danielbichuetti
It seems that when the prompt truncation PR was merged for OpenAI #4179 this made it so that if the specified maximum length in the PromptNode definition is equal to or larger than the models own max sequence length, then the truncation can also truncate the prompt itself + documents too e.g. for models that have a significantly lower token limit like flan models. So the Prompt is empty, and documents are also truncated.
To Reproduce
FAQ Check
The text was updated successfully, but these errors were encountered: