Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pydantic Validation Error when loading an AmazonTextractPDF from an S3 #26781

Closed
5 tasks done
ElReyZero opened this issue Sep 23, 2024 · 0 comments · Fixed by #26787
Closed
5 tasks done

Pydantic Validation Error when loading an AmazonTextractPDF from an S3 #26781

ElReyZero opened this issue Sep 23, 2024 · 0 comments · Fixed by #26787
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature investigate

Comments

@ElReyZero
Copy link
Contributor

ElReyZero commented Sep 23, 2024

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

The following code causes the exception:

import boto3
from langchain_community.document_loaders import AmazonTextractPDFLoader

session = boto3.Session()
client = session.client('textract', region_name='us-east-1')
s3_path = f"s3://{bucket}/{file_name}"
loader = AmazonTextractPDFLoader(s3_path, client=client)
pdf_pages = loader.load()

Error Message and Stack Trace (if applicable)

1 validation error for Blob
Traceback (most recent call last):
File "/home/lumu/Documents/Projects/trv/Document-Processing-Backend/modules/process_document.py", line 96, in process_document
doc_list, doc_texts = load_files_ensemble(document.doc_name)
File "/home/lumu/Documents/Projects/trv/Document-Processing-Backend/modules/ocr/textract/ensemble.py", line 83, in load_files_ensemble
return load_and_split(file_name)
File "/home/lumu/Documents/Projects/trv/Document-Processing-Backend/modules/ocr/textract/ensemble.py", line 32, in load_and_split
pdf_pages = loader.load()
File "/home/lumu/.local/lib/python3.10/site-packages/langchain_community/document_loaders/pdf.py", line 747, in load
return list(self.lazy_load())
File "/home/lumu/.local/lib/python3.10/site-packages/langchain_community/document_loaders/pdf.py", line 758, in lazy_load
blob = Blob(path=self.web_path) # type: ignore[call-arg] # type: ignore[misc]
File "/home/lumu/.local/lib/python3.10/site-packages/langchain_core/load/serializable.py", line 112, in init
super().init(*args, **kwargs)
File "/home/lumu/.local/lib/python3.10/site-packages/pydantic/main.py", line 212, in init
validated_self = self.pydantic_validator.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for Blob
data
Field required [type=missing, input_value={'path': 's3://realviewde...77_Walter_Hardwick.pdf'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.9/v/missing

Description

I'm trying to load a document with the textract pdf loader, however there is a bug and doesn't let it load. I've already checked and the document is on the S3, the error is caused due to a failure with pydantic and not due to a not found document error.

System Info

System Information

OS: Linux
OS Version: #41~22.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 3 11:32:55 UTC 2
Python Version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]

Package Information

langchain_core: 0.3.0
langchain: 0.3.0
langchain_community: 0.3.0
langsmith: 0.1.125
langchain_cohere: 0.3.0
langchain_experimental: 0.3.0
langchain_google_genai: 1.0.8
langchain_google_vertexai: 2.0.1
langchain_openai: 0.2.0
langchain_postgres: 0.0.9
langchain_text_splitters: 0.3.0

Optional packages not installed

langgraph
langserve

Other Dependencies

aiohttp: 3.9.5
anthropic[vertexai]: Installed. No version info available.
async-timeout: 4.0.3
cohere: 5.6.2
dataclasses-json: 0.6.3
google-cloud-aiplatform: 1.67.1
google-cloud-storage: 2.18.2
google-generativeai: 0.7.2
httpx: 0.27.0
httpx-sse: 0.4.0
jsonpatch: 1.33
langchain-mistralai: Installed. No version info available.
numpy: 1.26.4
openai: 1.47.1
orjson: 3.10.6
packaging: 23.2
pandas: 2.1.4
pgvector: 0.2.5
pillow: 10.2.0
psycopg: 3.2.1
psycopg-pool: 3.2.2
pydantic: 2.9.2
pydantic-settings: 2.5.2
PyYAML: 5.4.1
requests: 2.31.0
sqlalchemy: 2.0.12
SQLAlchemy: 2.0.12
tabulate: 0.9.0
tenacity: 8.2.3
tiktoken: 0.7.0
typing-extensions: 4.12.2

@langcarl langcarl bot added the investigate label Sep 23, 2024
@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Sep 23, 2024
ccurme added a commit that referenced this issue Sep 23, 2024
Sheepsta300 pushed a commit to Sheepsta300/langchain that referenced this issue Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature investigate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant