Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstractions for document processing #2833

Closed
wants to merge 17 commits into from

Conversation

eyurtsev
Copy link
Collaborator

WIP

Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some high level questions:

  1. why is the difference between blob and document important? is it because there is common logic to be factored out for processing? If so, why cant those just be helper functions to transform str -> str or Document -> Document?

  2. If blobs are so great, should lazy_load just always return a generator of blobs?

@eyurtsev
Copy link
Collaborator Author

eyurtsev commented Apr 14, 2023

This PR is experimental, trying to see if introducing another data type (blob) will help out reduce complexity / make code more re-usable.

is it because there is common logic to be factored out for processing?

See section below for more details. There's code duplication between the loaders. Some loaders hard-code parsers that should be configurable. A lot of loaders hard-code assumption that the file resides on the local file system

why is the difference between blob and document important?

The current definition of a document is parsed (non raw content).

Options:

  1. Leave as is
  2. Extend definition to allow a document to represent raw data by either reference or value (but this can be ugly for downstream users)
  3. Introduce a blob to represent raw data by either reference or value

If so, why cant those just be helper functions to transform str -> str or Document -> Document?

Raw data isn't a str, but do-able with option (2)

With option (2), loaders listed below (s3, azure, directory loader) would yield blobs, allowing one to specify which parser should be used to convert the raw data into a document, and allow avoiding re-implementation of file handling code.

If blobs are so great, should lazy_load just always return a generator of blobs?

  • If blobs are great, and starting from scratch, then yes
  • For backwards compatibility, may have to allow yielding docs as well
  • Worried that blobs don't make sense all the time; e.g., loading structured data from a database, maybe it makes sense to sink it directly into a document rather than go through a blob that contains json representation of data

@hwchase17 Zooming out for context; 2 main issues I see:

  1. Eager loading of data into memory. If it's 1 file it doesn't matter, but some of the loaders may load a large (and unknown) number of files.

  2. A lot of the generic loaders are coupled to the unstructured parser. If someone needs to change the parser for whatever reason, there's no way to do it, except for creating their own parser.

Here is a few loaders that should be generic that hard-code unstructured parser (and also load content eagerly)

Directory loader
https://github.com/hwchase17/langchain/blob/3c7204d604fe3700f37d406e1f112da710a35864/langchain/document_loaders/directory.py#L26

Azure:
https://github.com/hwchase17/langchain/blob/3c7204d604fe3700f37d406e1f112da710a35864/langchain/document_loaders/azure_blob_storage_container.py#L36-L36

https://github.com/hwchase17/langchain/blob/3c7204d604fe3700f37d406e1f112da710a35864/langchain/document_loaders/azure_blob_storage_file.py#L40-L40

S3:
https://github.com/hwchase17/langchain/blob/3c7204d604fe3700f37d406e1f112da710a35864/langchain/document_loaders/s3_directory.py#L29-L29

But langchain already supports different strategies for loading PDFs -- one just can't use them with the s3 loader or the azure loader etc.

https://github.com/hwchase17/langchain/blob/3c7204d604fe3700f37d406e1f112da710a35864/langchain/document_loaders/pdf.py#L167-L167

@eyurtsev
Copy link
Collaborator Author

eyurtsev commented Apr 14, 2023

Here's an example of the PDF loader implementing utility code to resolve file vs. URLs to load the content.
But this logic should be common to all file types (not just PDF)

https://github.com/hwchase17/langchain/blob/main/langchain/document_loaders/pdf.py#L37-L37

@eyurtsev
Copy link
Collaborator Author

eyurtsev commented Apr 14, 2023

Proposal (Version 1)

  1. Create Blob abstraction with implementatinon
  2. Create BlobLoader abstract interface
  3. Provide built-in implementations of blob loaders for popular file storage mechanisms (e.g. s3, file system, gcs, azure etc)
  4. Refactor existing document loaders to use blob loaders when applicable (e.g., notion /roam loader can re-use the file system loader; )
  5. Introduce lazy loading interface for document loaders
  6. Create abstract interface for parsers
  7. Create common parser implementations (e.g., PDFMiner, PyMuPDF)
  8. Create "routing" parsers; e.g., parser that decides parsing strategy based on mime-type
  9. Make it easy to compose a BlobLoader with a parser to make it easy for folks to re-use code blocks.
  10. Potentially work on document processing abstractions
  11. Create async variations of above ^ where relevant

Option #2

Instead of introducing a new Blob data type, extend the content of Document model to allow it to represent raw data and represent the data by either reference or by value

Option #3

Minimal changes for now. Perhaps only add an lazy_load that yields documents to the standard Loader interface.

TODO

Current Loader interface also tackles a load_and_split method which may need a lazy variant as well (I would prefer to not split by default on the loader class and instead pipeline document processors)

Copy link
Contributor

@dev2049 dev2049 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool, have a few noob questions

items = p.rglob(self.glob) if self.recursive else p.glob(self.glob)
for item in items:
if item.is_file():
if _is_visible(item.relative_to(p)) or self.load_hidden:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we flip, since checking load_hidden (ever so slightly) faster

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah can update -- this was copied from an existing implementation


from langchain.docstore.document import Document
from langchain.document_loaders.blob_loaders import Blob
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it weird that base imports from something more nested

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps BlobParser should live in blob_loaders?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can definitely move the BlobParser, we'll need to decide what's the most logical place

One thing I wasn't so sure about is that the document loaders right now are basically content fetchers + parsers and the majority of their work should be focused on parsing, so it felt like a parser abstraction could be accommodated at this level of the hierarchy. I'll start diffing smaller more careful changes now, and we can iron out all the namespaces to make sure everything is in the most logical place.

base aka typedefs ideally don't import much at all (except for other typedefs), so there's that...

IMO it's generally OK to import more nested code, since the nested code is owned by the importing code. The dangerous imports are when reaching into parent paths or sibling paths since that means that we're reaching into nested code that doesn't belong to the given module/package

self,
path: str,
glob: str = "**/[!.]*",
*,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what're pros of including this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this being * or the default values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*

class FileSystemLoader(BlobLoader):
"""Loading logic for loading documents from a directory."""

def __init__(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hwchase17 how do we think about when to use pydantic (and let it handle constructors) vs not

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- this should've probably be pydantic, i was moving code around carelessly


def load(self) -> List[Document]:
from bs4 import BeautifulSoup
"""Load all documents eagerly."""
return list(self.lazy_load())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be implemented on BaseLoader?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not super clean, since if users don't want to implement lazy_load they need to overwrite load

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do either way -- one possibility is another base class that provides a default implementation

* A blob parser provides a way to parse a blob into one or more documents
"""

@abc.abstractmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abstractmethod


import abc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

del

loader_kwargs: Keyword arguments to pass to loader_cls.
recursive: If True, will recursively load files.
"""
self.loader = FileSystemLoader(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would something like blob_loader or file_loader be clearer name?

from langchain.schema import Document


class BaseDocumentProcessor(ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we now have DocumentTransformer abstraction that seems very similar, should we try combining?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably yes -- will need to take a look at those

these abstractions will go in last in the sequencing order

"""Initialize with path."""
self.file_path = path
self.loader = FileSystemLoader(path, glob="**/*.md")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what're your thoughts on having BlobLoaders as instance vars versus inheriting from them? e.g. an alternative implementation could be

class RoamLoader(BaseLoader, FileSystemLoader):
  def __init__(self, path):
    super().__init__(path, glob="**/*.md")
    
  def lazy_load(self) -> ...:
    for blob in self.yield_blobs():
      yield ...

i don't have a good way of deciding b/n the two approaches, curious if you do

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer encouraging a compositional pattern, so the loader doesn't end up getting tied to a particular storage system (e.g., the file system).

i.e., if the md files are stored on s3 rather than the local file system, we'd want to swap out the loader instead of implementing another RoamLoader class that's specialized for s3

# Represent location on the local file system
# Useful for situations where downstream code assumes it must work with file paths
# rather than in-memory content.
path: Optional[PathLike] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if this was pulled from e.g., a URL this would be None? A temp dir?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undecided -- we can set to None or set to the source URL if known or set to a temp location on disk if it was downloaded to the file system and not stored in memory.

Any opinions?

We can extend it to support URLs and support a driver for loading content feels a bit complex..

e.g., there's more than one way to fetch an HTML file (one with requests and another with something like playwright to execute the js)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have opinions here. You're right, that sounds complex

Returns:
Blob instance
"""
mimetype = mimetypes.guess_type(path)[0] if guess_type else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooc: would it make sense to make the mimetype required and use a pptern like

    @classmethod
    def from_path(
        cls,
        path: Union[str, PurePath],
        *,
        encoding: str = "utf-8",
        mimetype: Optional[str] = None,
    ) -> "Blob":
... 
if mimetype is None:
    mimetype = mimetypes.guess_type(path)[0]
...

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

Copy link
Collaborator Author

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vowelparrot @dev2049 Thank you for the feedback!


from langchain.docstore.document import Document
from langchain.document_loaders.blob_loaders import Blob
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can definitely move the BlobParser, we'll need to decide what's the most logical place

One thing I wasn't so sure about is that the document loaders right now are basically content fetchers + parsers and the majority of their work should be focused on parsing, so it felt like a parser abstraction could be accommodated at this level of the hierarchy. I'll start diffing smaller more careful changes now, and we can iron out all the namespaces to make sure everything is in the most logical place.

base aka typedefs ideally don't import much at all (except for other typedefs), so there's that...

IMO it's generally OK to import more nested code, since the nested code is owned by the importing code. The dangerous imports are when reaching into parent paths or sibling paths since that means that we're reaching into nested code that doesn't belong to the given module/package

class FileSystemLoader(BlobLoader):
"""Loading logic for loading documents from a directory."""

def __init__(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- this should've probably be pydantic, i was moving code around carelessly

self,
path: str,
glob: str = "**/[!.]*",
*,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this being * or the default values?

items = p.rglob(self.glob) if self.recursive else p.glob(self.glob)
for item in items:
if item.is_file():
if _is_visible(item.relative_to(p)) or self.load_hidden:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah can update -- this was copied from an existing implementation

# Represent location on the local file system
# Useful for situations where downstream code assumes it must work with file paths
# rather than in-memory content.
path: Optional[PathLike] = None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undecided -- we can set to None or set to the source URL if known or set to a temp location on disk if it was downloaded to the file system and not stored in memory.

Any opinions?

We can extend it to support URLs and support a driver for loading content feels a bit complex..

e.g., there's more than one way to fetch an HTML file (one with requests and another with something like playwright to execute the js)

Returns:
Blob instance
"""
mimetype = mimetypes.guess_type(path)[0] if guess_type else None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

yield from sub_docs
except Exception as e:
if self.silent_errors:
logger.warning(e)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we should update to logger.error probably or allow controlling the level


def load(self) -> List[Document]:
from bs4 import BeautifulSoup
"""Load all documents eagerly."""
return list(self.lazy_load())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do either way -- one possibility is another base class that provides a default implementation

from langchain.schema import Document


class BaseDocumentProcessor(ABC):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably yes -- will need to take a look at those

these abstractions will go in last in the sequencing order

"""Initialize with path."""
self.file_path = path
self.loader = FileSystemLoader(path, glob="**/*.md")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer encouraging a compositional pattern, so the loader doesn't end up getting tied to a particular storage system (e.g., the file system).

i.e., if the md files are stored on s3 rather than the local file system, we'd want to swap out the loader instead of implementing another RoamLoader class that's specialized for s3

eyurtsev added a commit that referenced this pull request Apr 27, 2023
This PR introduces a Blob data type and a Blob loader interface.

This is the first of a sequence of PRs that follows this proposal: 

#2833

The primary goals of these abstraction are:

* Decouple content loading from content parsing code.
* Help duplicated content loading code from document loaders.
* Make lazy loading a default for langchain.
eyurtsev added a commit that referenced this pull request Apr 27, 2023
Adding a lazy iteration for document loaders.

Following the plan here:
#2833

Keeping the `load` method as is for backwards compatibility. The `load`
returns a materialized list of documents and downstream users may rely on that
fact.

A new method that returns an iterable is introduced for handling lazy
loading.

---------

Co-authored-by: Zander Chase <130414180+vowelparrot@users.noreply.github.com>
vowelparrot pushed a commit that referenced this pull request Apr 28, 2023
This PR introduces a Blob data type and a Blob loader interface.

This is the first of a sequence of PRs that follows this proposal: 

#2833

The primary goals of these abstraction are:

* Decouple content loading from content parsing code.
* Help duplicated content loading code from document loaders.
* Make lazy loading a default for langchain.
vowelparrot added a commit that referenced this pull request Apr 28, 2023
Adding a lazy iteration for document loaders.

Following the plan here:
#2833

Keeping the `load` method as is for backwards compatibility. The `load`
returns a materialized list of documents and downstream users may rely on that
fact.

A new method that returns an iterable is introduced for handling lazy
loading.

---------

Co-authored-by: Zander Chase <130414180+vowelparrot@users.noreply.github.com>
samching pushed a commit to samching/langchain that referenced this pull request May 1, 2023
This PR introduces a Blob data type and a Blob loader interface.

This is the first of a sequence of PRs that follows this proposal: 

langchain-ai#2833

The primary goals of these abstraction are:

* Decouple content loading from content parsing code.
* Help duplicated content loading code from document loaders.
* Make lazy loading a default for langchain.
samching pushed a commit to samching/langchain that referenced this pull request May 1, 2023
Adding a lazy iteration for document loaders.

Following the plan here:
langchain-ai#2833

Keeping the `load` method as is for backwards compatibility. The `load`
returns a materialized list of documents and downstream users may rely on that
fact.

A new method that returns an iterable is introduced for handling lazy
loading.

---------

Co-authored-by: Zander Chase <130414180+vowelparrot@users.noreply.github.com>
eyurtsev added a commit that referenced this pull request May 6, 2023
This PR adds the BlobParser abstraction.

It follows the proposal described here:
#2833 (comment)
@eyurtsev eyurtsev mentioned this pull request May 9, 2023
@eyurtsev
Copy link
Collaborator Author

Closing PR as most of the stuff has already been committed!

@eyurtsev eyurtsev closed this May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants