Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: document_loaders classification #4069

Merged
merged 6 commits into from
May 14, 2023
Merged

docs: document_loaders classification #4069

merged 6 commits into from
May 14, 2023

Conversation

leo-gan
Copy link
Collaborator

@leo-gan leo-gan commented May 3, 2023

Problem statement: the document_loaders section is too long and hard to comprehend.
Proposal: group document_loaders by 3 classes: (see Files changed tab)

UPDATE: I've completely reworked the document_loader classification.
Now this PR changes only one file!

FYI @eyurtsev @hwchase17

@leo-gan leo-gan marked this pull request as ready for review May 3, 2023 21:54
Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm in theory i like this, but the distinction seems a bit blurred. for example, why is Google Drive a formatter and not a knowledge document loader

@@ -37,7 +37,7 @@
"# This is from https://langchain.readthedocs.io/en/latest/modules/document_loaders/examples/csv.html\n",
"\n",
"from langchain.document_loaders.csv_loader import CSVLoader\n",
"loader = CSVLoader(file_path='../../document_loaders/examples/example_data/mlb_teams_2012.csv')\n",
"loader = CSVLoader(file_path='../../document_loaders/examples/../example_data/mlb_teams_2012.csv')\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird pathing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOps. My bad.

@@ -25,7 +25,7 @@
"metadata": {},
"outputs": [],
"source": [
"loader = NotebookLoader(\"example_data/notebook.ipynb\")"
"loader = NotebookLoader(\"../../../example_data/notebook.ipynb\")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldnt really need to change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We don't need to change test data. Thanks.

hmmm in theory i like this, but the distinction seems a bit blurred. for example, why is Google Drive a formatter and not a knowledge document loader

I'll add more description to this.
The idea is "knowledge loader" works with storage that we do not control. Something that can be used as a "tool" (in terms of LangChain). That can be accessed with queries. Something that can be considered as a source of "external" knowledge. We can allow LLM to make queries and get information or we can download documents and use them in a more controllable way.
"Formatters" can be as easy as transformers for CSV, SQL, etc. But they also can be cloud services or app stores. They can be hosted out of our control but the information inside is under our control.

@leo-gan leo-gan marked this pull request as draft May 4, 2023 03:19
@leo-gan leo-gan marked this pull request as ready for review May 4, 2023 03:36
@hwchase17
Copy link
Contributor

The idea is "knowledge loader" works with storage that we do not control. Something that can be used as a "tool" (in terms of LangChain). That can be accessed with queries. Something that can be considered as a source of "external" knowledge. We can allow LLM to make queries and get information or we can download documents and use them in a more controllable way.
"Formatters" can be as easy as transformers for CSV, SQL, etc. But they also can be cloud services or app stores. They can be hosted out of our control but the information inside is under our control

hmm i think formatter and i think csv or word... but not like google drive. like google drive could have csv files in it

i would be down to split out the ones which related to a certain file type. eg csv/pdf/ppt/etc. and then other ones could load in from various locations (eg from drive or website etc) and use formatters under the hood

this may be related to some of the stuff @eyurtsev is working on?

@hwchase17
Copy link
Contributor

How about splitting it into 3 classes?

  • formatters: CSV, PDF, ...
  • controllable sources: Google Drive, Microsoft Word, Facebook Chat, ...
  • external sources: Guttenberg, iFixit, ...
    I still don't like the class names. That means the mental picture is not good

what is definition of those categories? eg why is microsoft word (.docx) not a format?

@eyurtsev
Copy link
Collaborator

eyurtsev commented May 5, 2023

Hello @leo-gan 👋

Thanks for helping with the docs!

I am slowly making changes to implement the plan that's outlined here: #2833 (comment)

The high level is to decouple the code that loads raw data (bytes) from the code that parses the raw data to generate documents.

It'll still be possible to define arbitrary document loaders, but it'll also become easier to re-use existing parsers in a document loader (or even existing blob loaders). Not sure that this would change the documentation much.

@leo-gan
Copy link
Collaborator Author

leo-gan commented May 8, 2023

How about splitting it into 3 classes?

  • formatters: CSV, PDF, ...
  • controllable sources: Google Drive, Microsoft Word, Facebook Chat, ...
  • external sources: Guttenberg, iFixit, ...
    I still don't like the class names. That means the mental picture is not good

what is definition of those categories? eg why is microsoft word (.docx) not a format?

@hwchase17 I've completely reworked the document_loader classification. Please, check it out.
One good side effect: Now this PR changes only one file.

@leo-gan leo-gan requested a review from hwchase17 May 9, 2023 03:59
@leo-gan leo-gan changed the title Docs: split document_loaders docs: split document_loaders May 11, 2023
@leo-gan leo-gan changed the title docs: split document_loaders docs: document_loaders classification May 12, 2023
@leo-gan
Copy link
Collaborator Author

leo-gan commented May 12, 2023

@hwchase17 any comments? If you are busy, maybe @dev2049 can help? TNX

Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome - thanks!

@hwchase17 hwchase17 merged commit 3ce78ef into langchain-ai:master May 14, 2023
@leo-gan leo-gan deleted the docs_split_doc_loaders branch May 15, 2023 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants