Abstractions for document processing #2833

eyurtsev · 2023-04-13T15:25:17Z

WIP

hwchase17

some high level questions:

why is the difference between blob and document important? is it because there is common logic to be factored out for processing? If so, why cant those just be helper functions to transform str -> str or Document -> Document?
If blobs are so great, should lazy_load just always return a generator of blobs?

eyurtsev · 2023-04-14T13:50:12Z

This PR is experimental, trying to see if introducing another data type (blob) will help out reduce complexity / make code more re-usable.

is it because there is common logic to be factored out for processing?

See section below for more details. There's code duplication between the loaders. Some loaders hard-code parsers that should be configurable. A lot of loaders hard-code assumption that the file resides on the local file system

why is the difference between blob and document important?

The current definition of a document is parsed (non raw content).

Options:

Leave as is
Extend definition to allow a document to represent raw data by either reference or value (but this can be ugly for downstream users)
Introduce a blob to represent raw data by either reference or value

If so, why cant those just be helper functions to transform str -> str or Document -> Document?

Raw data isn't a str, but do-able with option (2)

With option (2), loaders listed below (s3, azure, directory loader) would yield blobs, allowing one to specify which parser should be used to convert the raw data into a document, and allow avoiding re-implementation of file handling code.

If blobs are so great, should lazy_load just always return a generator of blobs?

If blobs are great, and starting from scratch, then yes
For backwards compatibility, may have to allow yielding docs as well
Worried that blobs don't make sense all the time; e.g., loading structured data from a database, maybe it makes sense to sink it directly into a document rather than go through a blob that contains json representation of data

@hwchase17 Zooming out for context; 2 main issues I see:

Eager loading of data into memory. If it's 1 file it doesn't matter, but some of the loaders may load a large (and unknown) number of files.
A lot of the generic loaders are coupled to the unstructured parser. If someone needs to change the parser for whatever reason, there's no way to do it, except for creating their own parser.

Here is a few loaders that should be generic that hard-code unstructured parser (and also load content eagerly)

Directory loader
https://github.com/hwchase17/langchain/blob/3c7204d604fe3700f37d406e1f112da710a35864/langchain/document_loaders/directory.py#L26

Azure:
https://github.com/hwchase17/langchain/blob/3c7204d604fe3700f37d406e1f112da710a35864/langchain/document_loaders/azure_blob_storage_container.py#L36-L36

https://github.com/hwchase17/langchain/blob/3c7204d604fe3700f37d406e1f112da710a35864/langchain/document_loaders/azure_blob_storage_file.py#L40-L40

S3:
https://github.com/hwchase17/langchain/blob/3c7204d604fe3700f37d406e1f112da710a35864/langchain/document_loaders/s3_directory.py#L29-L29

But langchain already supports different strategies for loading PDFs -- one just can't use them with the s3 loader or the azure loader etc.

https://github.com/hwchase17/langchain/blob/3c7204d604fe3700f37d406e1f112da710a35864/langchain/document_loaders/pdf.py#L167-L167

eyurtsev · 2023-04-14T14:06:51Z

Here's an example of the PDF loader implementing utility code to resolve file vs. URLs to load the content.
But this logic should be common to all file types (not just PDF)

https://github.com/hwchase17/langchain/blob/main/langchain/document_loaders/pdf.py#L37-L37

eyurtsev · 2023-04-14T19:05:54Z

Proposal (Version 1)

Create Blob abstraction with implementatinon
Create BlobLoader abstract interface
Provide built-in implementations of blob loaders for popular file storage mechanisms (e.g. s3, file system, gcs, azure etc)
Refactor existing document loaders to use blob loaders when applicable (e.g., notion /roam loader can re-use the file system loader; )
Introduce lazy loading interface for document loaders
Create abstract interface for parsers
Create common parser implementations (e.g., PDFMiner, PyMuPDF)
Create "routing" parsers; e.g., parser that decides parsing strategy based on mime-type
Make it easy to compose a BlobLoader with a parser to make it easy for folks to re-use code blocks.
Potentially work on document processing abstractions
Create async variations of above ^ where relevant

Option #2

Instead of introducing a new Blob data type, extend the content of Document model to allow it to represent raw data and represent the data by either reference or by value

Option #3

Minimal changes for now. Perhaps only add an lazy_load that yields documents to the standard Loader interface.

TODO

Current Loader interface also tackles a load_and_split method which may need a lazy variant as well (I would prefer to not split by default on the loader class and instead pipeline document processors)

dev2049

very cool, have a few noob questions

dev2049 · 2023-04-24T19:48:55Z

langchain/document_loaders/blob_loaders/file_system.py

+        items = p.rglob(self.glob) if self.recursive else p.glob(self.glob)
+        for item in items:
+            if item.is_file():
+                if _is_visible(item.relative_to(p)) or self.load_hidden:


can we flip, since checking load_hidden (ever so slightly) faster

yeah can update -- this was copied from an existing implementation

dev2049 · 2023-04-24T19:49:53Z

langchain/document_loaders/base.py


-from langchain.docstore.document import Document
+from langchain.document_loaders.blob_loaders import Blob


is it weird that base imports from something more nested

perhaps BlobParser should live in blob_loaders?

Yeah we can definitely move the BlobParser, we'll need to decide what's the most logical place

One thing I wasn't so sure about is that the document loaders right now are basically content fetchers + parsers and the majority of their work should be focused on parsing, so it felt like a parser abstraction could be accommodated at this level of the hierarchy. I'll start diffing smaller more careful changes now, and we can iron out all the namespaces to make sure everything is in the most logical place.

base aka typedefs ideally don't import much at all (except for other typedefs), so there's that...

IMO it's generally OK to import more nested code, since the nested code is owned by the importing code. The dangerous imports are when reaching into parent paths or sibling paths since that means that we're reaching into nested code that doesn't belong to the given module/package

dev2049 · 2023-04-24T19:51:10Z

langchain/document_loaders/blob_loaders/file_system.py

+        self,
+        path: str,
+        glob: str = "**/[!.]*",
+        *,


what're pros of including this?

this being * or the default values?

dev2049 · 2023-04-24T19:52:12Z

langchain/document_loaders/blob_loaders/file_system.py

+class FileSystemLoader(BlobLoader):
+    """Loading logic for loading documents from a directory."""
+
+    def __init__(


@hwchase17 how do we think about when to use pydantic (and let it handle constructors) vs not

Good catch -- this should've probably be pydantic, i was moving code around carelessly

dev2049 · 2023-04-24T20:02:06Z

langchain/document_loaders/html_bs.py


    def load(self) -> List[Document]:
-        from bs4 import BeautifulSoup
+        """Load all documents eagerly."""
+        return list(self.lazy_load())


could this be implemented on BaseLoader?

maybe not super clean, since if users don't want to implement lazy_load they need to overwrite load

Could do either way -- one possibility is another base class that provides a default implementation

dev2049 · 2023-04-24T20:13:50Z

langchain/document_loaders/base.py

+    * A blob parser provides a way to parse a blob into one or more documents
+    """
+
+    @abc.abstractmethod


@abstractmethod

dev2049 · 2023-04-24T20:14:05Z

langchain/document_loaders/base.py


+import abc


dev2049 · 2023-04-24T20:14:54Z

langchain/document_loaders/directory.py

+            loader_kwargs: Keyword arguments to pass to loader_cls.
+            recursive: If True, will recursively load files.
+        """
+        self.loader = FileSystemLoader(


nit: would something like blob_loader or file_loader be clearer name?

dev2049 · 2023-04-24T20:19:31Z

langchain/document_loaders/processing/schema.py

+from langchain.schema import Document
+
+
+class BaseDocumentProcessor(ABC):


we now have DocumentTransformer abstraction that seems very similar, should we try combining?

probably yes -- will need to take a look at those

these abstractions will go in last in the sequencing order

dev2049 · 2023-04-24T20:22:35Z

langchain/document_loaders/roam.py

        """Initialize with path."""
-        self.file_path = path
+        self.loader = FileSystemLoader(path, glob="**/*.md")


what're your thoughts on having BlobLoaders as instance vars versus inheriting from them? e.g. an alternative implementation could be

class RoamLoader(BaseLoader, FileSystemLoader): def __init__(self, path): super().__init__(path, glob="**/*.md") def lazy_load(self) -> ...: for blob in self.yield_blobs(): yield ...

i don't have a good way of deciding b/n the two approaches, curious if you do

I would prefer encouraging a compositional pattern, so the loader doesn't end up getting tied to a particular storage system (e.g., the file system).

i.e., if the md files are stored on s3 rather than the local file system, we'd want to swap out the loader instead of implementing another RoamLoader class that's specialized for s3

vowelparrot · 2023-04-25T16:46:52Z

langchain/document_loaders/blob_loaders/schema.py

+    # Represent location on the local file system
+    # Useful for situations where downstream code assumes it must work with file paths
+    # rather than in-memory content.
+    path: Optional[PathLike] = None


So if this was pulled from e.g., a URL this would be None? A temp dir?

Undecided -- we can set to None or set to the source URL if known or set to a temp location on disk if it was downloaded to the file system and not stored in memory.

Any opinions?

We can extend it to support URLs and support a driver for loading content feels a bit complex..

e.g., there's more than one way to fetch an HTML file (one with requests and another with something like playwright to execute the js)

I don't have opinions here. You're right, that sounds complex

vowelparrot · 2023-04-25T16:50:30Z

langchain/document_loaders/blob_loaders/schema.py

+        Returns:
+            Blob instance
+        """
+        mimetype = mimetypes.guess_type(path)[0] if guess_type else None


ooc: would it make sense to make the mimetype required and use a pptern like

@classmethod def from_path( cls, path: Union[str, PurePath], *, encoding: str = "utf-8", mimetype: Optional[str] = None, ) -> "Blob": ... if mimetype is None: mimetype = mimetypes.guess_type(path)[0] ...

?

eyurtsev

@vowelparrot @dev2049 Thank you for the feedback!

eyurtsev · 2023-04-26T16:22:35Z

langchain/document_loaders/base.py


-from langchain.docstore.document import Document
+from langchain.document_loaders.blob_loaders import Blob


Yeah we can definitely move the BlobParser, we'll need to decide what's the most logical place

One thing I wasn't so sure about is that the document loaders right now are basically content fetchers + parsers and the majority of their work should be focused on parsing, so it felt like a parser abstraction could be accommodated at this level of the hierarchy. I'll start diffing smaller more careful changes now, and we can iron out all the namespaces to make sure everything is in the most logical place.

base aka typedefs ideally don't import much at all (except for other typedefs), so there's that...

IMO it's generally OK to import more nested code, since the nested code is owned by the importing code. The dangerous imports are when reaching into parent paths or sibling paths since that means that we're reaching into nested code that doesn't belong to the given module/package

eyurtsev · 2023-04-26T16:26:15Z

langchain/document_loaders/blob_loaders/file_system.py

+class FileSystemLoader(BlobLoader):
+    """Loading logic for loading documents from a directory."""
+
+    def __init__(


Good catch -- this should've probably be pydantic, i was moving code around carelessly

eyurtsev · 2023-04-26T16:26:43Z

langchain/document_loaders/blob_loaders/file_system.py

+        self,
+        path: str,
+        glob: str = "**/[!.]*",
+        *,


this being * or the default values?

eyurtsev · 2023-04-26T16:27:05Z

langchain/document_loaders/blob_loaders/file_system.py

+        items = p.rglob(self.glob) if self.recursive else p.glob(self.glob)
+        for item in items:
+            if item.is_file():
+                if _is_visible(item.relative_to(p)) or self.load_hidden:


yeah can update -- this was copied from an existing implementation

eyurtsev · 2023-04-26T16:28:01Z

langchain/document_loaders/blob_loaders/schema.py

+    # Represent location on the local file system
+    # Useful for situations where downstream code assumes it must work with file paths
+    # rather than in-memory content.
+    path: Optional[PathLike] = None


Undecided -- we can set to None or set to the source URL if known or set to a temp location on disk if it was downloaded to the file system and not stored in memory.

Any opinions?

We can extend it to support URLs and support a driver for loading content feels a bit complex..

e.g., there's more than one way to fetch an HTML file (one with requests and another with something like playwright to execute the js)

eyurtsev · 2023-04-26T16:29:15Z

langchain/document_loaders/blob_loaders/schema.py

+        Returns:
+            Blob instance
+        """
+        mimetype = mimetypes.guess_type(path)[0] if guess_type else None


eyurtsev · 2023-04-26T16:30:30Z

langchain/document_loaders/directory.py

+                yield from sub_docs
+            except Exception as e:
+                if self.silent_errors:
+                    logger.warning(e)


yeah we should update to logger.error probably or allow controlling the level

eyurtsev · 2023-04-26T16:31:32Z

langchain/document_loaders/html_bs.py


    def load(self) -> List[Document]:
-        from bs4 import BeautifulSoup
+        """Load all documents eagerly."""
+        return list(self.lazy_load())


Could do either way -- one possibility is another base class that provides a default implementation

eyurtsev · 2023-04-26T16:33:05Z

langchain/document_loaders/processing/schema.py

+from langchain.schema import Document
+
+
+class BaseDocumentProcessor(ABC):


probably yes -- will need to take a look at those

these abstractions will go in last in the sequencing order

eyurtsev · 2023-04-26T16:34:31Z

langchain/document_loaders/roam.py

        """Initialize with path."""
-        self.file_path = path
+        self.loader = FileSystemLoader(path, glob="**/*.md")


I would prefer encouraging a compositional pattern, so the loader doesn't end up getting tied to a particular storage system (e.g., the file system).

i.e., if the md files are stored on s3 rather than the local file system, we'd want to swap out the loader instead of implementing another RoamLoader class that's specialized for s3

This PR introduces a Blob data type and a Blob loader interface. This is the first of a sequence of PRs that follows this proposal: #2833 The primary goals of these abstraction are: * Decouple content loading from content parsing code. * Help duplicated content loading code from document loaders. * Make lazy loading a default for langchain.

Adding a lazy iteration for document loaders. Following the plan here: #2833 Keeping the `load` method as is for backwards compatibility. The `load` returns a materialized list of documents and downstream users may rely on that fact. A new method that returns an iterable is introduced for handling lazy loading. --------- Co-authored-by: Zander Chase <130414180+vowelparrot@users.noreply.github.com>

This PR introduces a Blob data type and a Blob loader interface. This is the first of a sequence of PRs that follows this proposal: #2833 The primary goals of these abstraction are: * Decouple content loading from content parsing code. * Help duplicated content loading code from document loaders. * Make lazy loading a default for langchain.

Adding a lazy iteration for document loaders. Following the plan here: #2833 Keeping the `load` method as is for backwards compatibility. The `load` returns a materialized list of documents and downstream users may rely on that fact. A new method that returns an iterable is introduced for handling lazy loading. --------- Co-authored-by: Zander Chase <130414180+vowelparrot@users.noreply.github.com>

This PR introduces a Blob data type and a Blob loader interface. This is the first of a sequence of PRs that follows this proposal: langchain-ai#2833 The primary goals of these abstraction are: * Decouple content loading from content parsing code. * Help duplicated content loading code from document loaders. * Make lazy loading a default for langchain.

Adding a lazy iteration for document loaders. Following the plan here: langchain-ai#2833 Keeping the `load` method as is for backwards compatibility. The `load` returns a materialized list of documents and downstream users may rely on that fact. A new method that returns an iterable is introduced for handling lazy loading. --------- Co-authored-by: Zander Chase <130414180+vowelparrot@users.noreply.github.com>

This PR adds the BlobParser abstraction. It follows the proposal described here: #2833 (comment)

eyurtsev · 2023-05-23T21:05:20Z

Closing PR as most of the stuff has already been committed!

hwchase17 and others added 10 commits April 7, 2023 12:24

document processor

31f92cf

uPdate

b3f3ed0

update

7e4d811

Merge branch 'master' into eugene/processing

2b5a29f

UPdate

c2dae41

update

5e93db3

update

1e359ee

update

99e40bd

UPdate

73e11e6

update

3c46c2d

hwchase17 reviewed Apr 14, 2023

View reviewed changes

eyurtsev added 6 commits April 14, 2023 10:47

update

919532f

update

905d0c3

update

5572747

update

b7f98dc

update

da7c6c6

UPdate

fa3be97

dev2049 reviewed Apr 24, 2023

View reviewed changes

vowelparrot reviewed Apr 25, 2023

View reviewed changes

eyurtsev commented Apr 26, 2023

View reviewed changes

Merge branch 'master' into eugene/processing

ac2fe24

eyurtsev mentioned this pull request Apr 26, 2023

Introduce Blob and Blob Loader interface #3603

Merged

eyurtsev mentioned this pull request Apr 27, 2023

Add lazy iteration interface to document loaders #3659

Merged

eyurtsev mentioned this pull request Apr 28, 2023

Feature : Support for Hugging Face streaming dataset #3501

Closed

This was referenced May 2, 2023

Add PyPDFBytesLoader #3915

Closed

Add BlobParser abstraction #3979

Merged

docs: document_loaders classification #4069

Merged

eyurtsev added a commit that referenced this pull request May 6, 2023

Add BlobParser abstraction (#3979)

423f497

This PR adds the BlobParser abstraction. It follows the proposal described here: #2833 (comment)

eyurtsev mentioned this pull request May 9, 2023

Add SharePoint Loader #4284

Merged

eyurtsev closed this May 23, 2023


		from langchain.docstore.document import Document
		from langchain.document_loaders.blob_loaders import Blob

		from langchain.schema import Document


		class BaseDocumentProcessor(ABC):

Abstractions for document processing #2833

Abstractions for document processing #2833

Conversation

eyurtsev commented Apr 13, 2023

hwchase17 left a comment

Choose a reason for hiding this comment

eyurtsev commented Apr 14, 2023 • edited Loading

eyurtsev commented Apr 14, 2023 • edited Loading

eyurtsev commented Apr 14, 2023 • edited Loading

Proposal (Version 1)

Option #2

Option #3

TODO

dev2049 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eyurtsev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eyurtsev commented May 23, 2023

eyurtsev commented Apr 14, 2023 •

edited

Loading

eyurtsev commented Apr 14, 2023 •

edited

Loading

eyurtsev commented Apr 14, 2023 •

edited

Loading