Pydantic Validation Error when loading an AmazonTextractPDF from an S3 #26781

ElReyZero · 2024-09-23T17:18:20Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

The following code causes the exception:

import boto3
from langchain_community.document_loaders import AmazonTextractPDFLoader

session = boto3.Session()
client = session.client('textract', region_name='us-east-1')
s3_path = f"s3://{bucket}/{file_name}"
loader = AmazonTextractPDFLoader(s3_path, client=client)
pdf_pages = loader.load()

Error Message and Stack Trace (if applicable)

1 validation error for Blob
Traceback (most recent call last):
File "/home/lumu/Documents/Projects/trv/Document-Processing-Backend/modules/process_document.py", line 96, in process_document
doc_list, doc_texts = load_files_ensemble(document.doc_name)
File "/home/lumu/Documents/Projects/trv/Document-Processing-Backend/modules/ocr/textract/ensemble.py", line 83, in load_files_ensemble
return load_and_split(file_name)
File "/home/lumu/Documents/Projects/trv/Document-Processing-Backend/modules/ocr/textract/ensemble.py", line 32, in load_and_split
pdf_pages = loader.load()
File "/home/lumu/.local/lib/python3.10/site-packages/langchain_community/document_loaders/pdf.py", line 747, in load
return list(self.lazy_load())
File "/home/lumu/.local/lib/python3.10/site-packages/langchain_community/document_loaders/pdf.py", line 758, in lazy_load
blob = Blob(path=self.web_path) # type: ignore[call-arg] # type: ignore[misc]
File "/home/lumu/.local/lib/python3.10/site-packages/langchain_core/load/serializable.py", line 112, in init
super().init(*args, **kwargs)
File "/home/lumu/.local/lib/python3.10/site-packages/pydantic/main.py", line 212, in init
validated_self = self.pydantic_validator.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for Blob
data
Field required [type=missing, input_value={'path': 's3://realviewde...77_Walter_Hardwick.pdf'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.9/v/missing

Description

I'm trying to load a document with the textract pdf loader, however there is a bug and doesn't let it load. I've already checked and the document is on the S3, the error is caused due to a failure with pydantic and not due to a not found document error.

System Info

System Information

OS: Linux
OS Version: #41~22.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 3 11:32:55 UTC 2
Python Version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]

Package Information

langchain_core: 0.3.0
langchain: 0.3.0
langchain_community: 0.3.0
langsmith: 0.1.125
langchain_cohere: 0.3.0
langchain_experimental: 0.3.0
langchain_google_genai: 1.0.8
langchain_google_vertexai: 2.0.1
langchain_openai: 0.2.0
langchain_postgres: 0.0.9
langchain_text_splitters: 0.3.0

Optional packages not installed

langgraph
langserve

Other Dependencies

aiohttp: 3.9.5
anthropic[vertexai]: Installed. No version info available.
async-timeout: 4.0.3
cohere: 5.6.2
dataclasses-json: 0.6.3
google-cloud-aiplatform: 1.67.1
google-cloud-storage: 2.18.2
google-generativeai: 0.7.2
httpx: 0.27.0
httpx-sse: 0.4.0
jsonpatch: 1.33
langchain-mistralai: Installed. No version info available.
numpy: 1.26.4
openai: 1.47.1
orjson: 3.10.6
packaging: 23.2
pandas: 2.1.4
pgvector: 0.2.5
pillow: 10.2.0
psycopg: 3.2.1
psycopg-pool: 3.2.2
pydantic: 2.9.2
pydantic-settings: 2.5.2
PyYAML: 5.4.1
requests: 2.31.0
sqlalchemy: 2.0.12
SQLAlchemy: 2.0.12
tabulate: 0.9.0
tenacity: 8.2.3
tiktoken: 0.7.0
typing-extensions: 4.12.2

Resolves #26781

Resolves langchain-ai#26781

langcarl bot added the investigate label Sep 23, 2024

dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Sep 23, 2024

ccurme mentioned this issue Sep 23, 2024

core[patch]: set default on Blob #26787

Merged

ccurme closed this as completed in #26787 Sep 23, 2024

ccurme added a commit that referenced this issue Sep 23, 2024

core[patch]: set default on Blob (#26787)

bba7af9

Resolves #26781

Sheepsta300 pushed a commit to Sheepsta300/langchain that referenced this issue Oct 1, 2024

core[patch]: set default on Blob (langchain-ai#26787)

883ff61

Resolves langchain-ai#26781

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pydantic Validation Error when loading an AmazonTextractPDF from an S3 #26781

Pydantic Validation Error when loading an AmazonTextractPDF from an S3 #26781

ElReyZero commented Sep 23, 2024 •

edited

Loading

Pydantic Validation Error when loading an AmazonTextractPDF from an S3 #26781

Pydantic Validation Error when loading an AmazonTextractPDF from an S3 #26781

Comments

ElReyZero commented Sep 23, 2024 • edited Loading

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

ElReyZero commented Sep 23, 2024 •

edited

Loading