Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when importing larger PDF files #50

Closed
duob-ai opened this issue Nov 22, 2023 · 9 comments
Closed

Error when importing larger PDF files #50

duob-ai opened this issue Nov 22, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@duob-ai
Copy link

duob-ai commented Nov 22, 2023

When trying to import larger PDF files (tested with upwards of 20 pages) using ADAEmbedder I'm seeing the following error in the front end and in the console.

However the embeddings somehow seem to be generated since asking questions for that context works. But the uploaded document isn't showing in the frontend under the documents section.

frontend console
@thomashacker
Copy link
Collaborator

Interesting! Thanks for the issue. It looks like the chunks are too big for OpenAI's ADA model (max 8192 tokens, but somehow one chunks seems to be 23941 tokens). What are your chunking settings?

@thomashacker thomashacker added the bug Something isn't working label Nov 24, 2023
@duob-ai
Copy link
Author

duob-ai commented Nov 24, 2023

@thomashacker Chunking settings were 250 tokens with 50 overlap. Error should be easily replicable by just uploading long pdf files.

@cam-barts
Copy link
Contributor

@thomashacker I opened #55 which ended up being duplicative of the issue here. I added some of my investigation info in that issue, but tldr the issue isn't in the chunks, its in the document upload itself before the chunking can occur.

@thomashacker
Copy link
Collaborator

This is interesting @cam-barts ! Thank you so much for your investigation, this really helped. The document schema should not be vectorized in the first placed, since it only acts as a document store for Verba. This is definitely a bug, I'll look into it and fix it for the next release

@r0mdau
Copy link

r0mdau commented Nov 30, 2023

I encounter the same problem when starting Verba in a virtualenv with verba start and connecting to a weaviate instance running in docker.

No problem when using the weaviate embedded.

I had exactly the same problem with verba 0.2.

@micuentadecasa
Copy link

I have the same issue

@thomashacker
Copy link
Collaborator

Thanks everyone, I found the issue! The Docker configuration was set to automatically use the openai-vectorizer module, also for Documents, which was not intended. I committed the fix to the main branch 🚀 It should work now with larger documents

@micuentadecasa
Copy link

Thanks everyone, I found the issue! The Docker configuration was set to automatically use the openai-vectorizer module, also for Documents, which was not intended. I committed the fix to the main branch 🚀 It should work now with larger documents

Now it works for me, I had to clean all volumes, images, etc, and then a fresh environment and the new code did the trick.

@thomashacker
Copy link
Collaborator

That's great to hear!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants