Error when importing larger PDF files #50

duob-ai · 2023-11-22T20:46:20Z

When trying to import larger PDF files (tested with upwards of 20 pages) using ADAEmbedder I'm seeing the following error in the front end and in the console.

However the embeddings somehow seem to be generated since asking questions for that context works. But the uploaded document isn't showing in the frontend under the documents section.

thomashacker · 2023-11-24T13:01:48Z

Interesting! Thanks for the issue. It looks like the chunks are too big for OpenAI's ADA model (max 8192 tokens, but somehow one chunks seems to be 23941 tokens). What are your chunking settings?

duob-ai · 2023-11-24T15:07:56Z

@thomashacker Chunking settings were 250 tokens with 50 overlap. Error should be easily replicable by just uploading long pdf files.

cam-barts · 2023-11-24T20:30:08Z

@thomashacker I opened #55 which ended up being duplicative of the issue here. I added some of my investigation info in that issue, but tldr the issue isn't in the chunks, its in the document upload itself before the chunking can occur.

thomashacker · 2023-11-28T17:08:01Z

This is interesting @cam-barts ! Thank you so much for your investigation, this really helped. The document schema should not be vectorized in the first placed, since it only acts as a document store for Verba. This is definitely a bug, I'll look into it and fix it for the next release

r0mdau · 2023-11-30T03:19:08Z

I encounter the same problem when starting Verba in a virtualenv with verba start and connecting to a weaviate instance running in docker.

No problem when using the weaviate embedded.

I had exactly the same problem with verba 0.2.

micuentadecasa · 2023-12-02T17:59:52Z

I have the same issue

thomashacker · 2023-12-05T10:25:27Z

Thanks everyone, I found the issue! The Docker configuration was set to automatically use the openai-vectorizer module, also for Documents, which was not intended. I committed the fix to the main branch 🚀 It should work now with larger documents

micuentadecasa · 2023-12-07T07:39:47Z

Thanks everyone, I found the issue! The Docker configuration was set to automatically use the openai-vectorizer module, also for Documents, which was not intended. I committed the fix to the main branch 🚀 It should work now with larger documents

Now it works for me, I had to clean all volumes, images, etc, and then a fresh environment and the new code did the trick.

thomashacker · 2023-12-07T10:59:22Z

That's great to hear!

thomashacker added the bug Something isn't working label Nov 24, 2023

cam-barts mentioned this issue Nov 24, 2023

Embedding Failure With Sufficiently Long Documents #55

Closed

cam-barts mentioned this issue Nov 27, 2023

Document f70ddcbd-b9f7-4f38-9f61-4f6e7604fb86 not found None #62

Closed

thomashacker closed this as completed Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when importing larger PDF files #50

Error when importing larger PDF files #50

duob-ai commented Nov 22, 2023

thomashacker commented Nov 24, 2023

duob-ai commented Nov 24, 2023

cam-barts commented Nov 24, 2023

thomashacker commented Nov 28, 2023

r0mdau commented Nov 30, 2023 •

edited

Loading

micuentadecasa commented Dec 2, 2023

thomashacker commented Dec 5, 2023

micuentadecasa commented Dec 7, 2023

thomashacker commented Dec 7, 2023

Error when importing larger PDF files #50

Error when importing larger PDF files #50

Comments

duob-ai commented Nov 22, 2023

thomashacker commented Nov 24, 2023

duob-ai commented Nov 24, 2023

cam-barts commented Nov 24, 2023

thomashacker commented Nov 28, 2023

r0mdau commented Nov 30, 2023 • edited Loading

micuentadecasa commented Dec 2, 2023

thomashacker commented Dec 5, 2023

micuentadecasa commented Dec 7, 2023

thomashacker commented Dec 7, 2023

r0mdau commented Nov 30, 2023 •

edited

Loading