docs: document_loaders classification #4069

leo-gan · 2023-05-03T21:09:17Z

Problem statement: the document_loaders section is too long and hard to comprehend.
Proposal: group document_loaders by 3 classes: (see Files changed tab)

UPDATE: I've completely reworked the document_loader classification.
Now this PR changes only one file!

FYI @eyurtsev @hwchase17

hwchase17

hmmm in theory i like this, but the distinction seems a bit blurred. for example, why is Google Drive a formatter and not a knowledge document loader

hwchase17 · 2023-05-03T23:01:30Z

docs/modules/indexes/retrievers/examples/chatgpt-plugin-retriever.ipynb

@@ -37,7 +37,7 @@
    "# This is from https://langchain.readthedocs.io/en/latest/modules/document_loaders/examples/csv.html\n",
    "\n",
    "from langchain.document_loaders.csv_loader import CSVLoader\n",
-    "loader = CSVLoader(file_path='../../document_loaders/examples/example_data/mlb_teams_2012.csv')\n",
+    "loader = CSVLoader(file_path='../../document_loaders/examples/../example_data/mlb_teams_2012.csv')\n",


weird pathing

OOps. My bad.

hwchase17 · 2023-05-03T23:02:20Z

docs/modules/indexes/document_loaders/examples/example_data/notebook.ipynb

@@ -25,7 +25,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "loader = NotebookLoader(\"example_data/notebook.ipynb\")"
+    "loader = NotebookLoader(\"../../../example_data/notebook.ipynb\")"


this shouldnt really need to change?

No. We don't need to change test data. Thanks.

hmmm in theory i like this, but the distinction seems a bit blurred. for example, why is Google Drive a formatter and not a knowledge document loader

I'll add more description to this.
The idea is "knowledge loader" works with storage that we do not control. Something that can be used as a "tool" (in terms of LangChain). That can be accessed with queries. Something that can be considered as a source of "external" knowledge. We can allow LLM to make queries and get information or we can download documents and use them in a more controllable way.
"Formatters" can be as easy as transformers for CSV, SQL, etc. But they also can be cloud services or app stores. They can be hosted out of our control but the information inside is under our control.

hwchase17 · 2023-05-04T05:32:14Z

The idea is "knowledge loader" works with storage that we do not control. Something that can be used as a "tool" (in terms of LangChain). That can be accessed with queries. Something that can be considered as a source of "external" knowledge. We can allow LLM to make queries and get information or we can download documents and use them in a more controllable way.
"Formatters" can be as easy as transformers for CSV, SQL, etc. But they also can be cloud services or app stores. They can be hosted out of our control but the information inside is under our control

hmm i think formatter and i think csv or word... but not like google drive. like google drive could have csv files in it

i would be down to split out the ones which related to a certain file type. eg csv/pdf/ppt/etc. and then other ones could load in from various locations (eg from drive or website etc) and use formatters under the hood

this may be related to some of the stuff @eyurtsev is working on?

hwchase17 · 2023-05-05T04:48:38Z

How about splitting it into 3 classes?

formatters: CSV, PDF, ...

controllable sources: Google Drive, Microsoft Word, Facebook Chat, ...

external sources: Guttenberg, iFixit, ...
I still don't like the class names. That means the mental picture is not good

what is definition of those categories? eg why is microsoft word (.docx) not a format?

eyurtsev · 2023-05-05T16:22:26Z

Hello @leo-gan 👋

Thanks for helping with the docs!

I am slowly making changes to implement the plan that's outlined here: #2833 (comment)

The high level is to decouple the code that loads raw data (bytes) from the code that parses the raw data to generate documents.

It'll still be possible to define arbitrary document loaders, but it'll also become easier to re-use existing parsers in a document loader (or even existing blob loaders). Not sure that this would change the documentation much.

leo-gan · 2023-05-08T15:15:58Z

How about splitting it into 3 classes?

formatters: CSV, PDF, ...

controllable sources: Google Drive, Microsoft Word, Facebook Chat, ...

external sources: Guttenberg, iFixit, ...
I still don't like the class names. That means the mental picture is not good

what is definition of those categories? eg why is microsoft word (.docx) not a format?

@hwchase17 I've completely reworked the document_loader classification. Please, check it out.
One good side effect: Now this PR changes only one file.

leo-gan · 2023-05-12T17:19:42Z

@hwchase17 any comments? If you are busy, maybe @dev2049 can help? TNX

hwchase17

awesome - thanks!

leo-gan marked this pull request as ready for review May 3, 2023 21:54

hwchase17 reviewed May 3, 2023

View reviewed changes

leo-gan marked this pull request as draft May 4, 2023 03:19

leo-gan marked this pull request as ready for review May 4, 2023 03:36

leo-gan added 6 commits May 5, 2023 19:34

updated document_loaders.rst

d6775c7

fixed indent

c728b7a

fixed duplicated folders

958c218

improved text

67c9d21

small text fixes

a3078c8

renamed files

3bd7b41

leo-gan requested a review from hwchase17 May 9, 2023 03:59

leo-gan changed the title ~~Docs: split document_loaders~~ docs: split document_loaders May 11, 2023

leo-gan changed the title ~~docs: split document_loaders~~ docs: document_loaders classification May 12, 2023

hwchase17 approved these changes May 14, 2023

View reviewed changes

hwchase17 merged commit 3ce78ef into langchain-ai:master May 14, 2023

leo-gan deleted the docs_split_doc_loaders branch May 15, 2023 04:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: document_loaders classification #4069

docs: document_loaders classification #4069

leo-gan commented May 3, 2023 •

edited

Loading

hwchase17 left a comment

hwchase17 May 3, 2023

leo-gan May 4, 2023

hwchase17 May 3, 2023

leo-gan May 4, 2023

hwchase17 commented May 4, 2023

hwchase17 commented May 5, 2023

eyurtsev commented May 5, 2023

leo-gan commented May 8, 2023

leo-gan commented May 12, 2023

hwchase17 left a comment

docs: document_loaders classification #4069

docs: document_loaders classification #4069

Conversation

leo-gan commented May 3, 2023 • edited Loading

hwchase17 left a comment

Choose a reason for hiding this comment

hwchase17 May 3, 2023

Choose a reason for hiding this comment

leo-gan May 4, 2023

Choose a reason for hiding this comment

hwchase17 May 3, 2023

Choose a reason for hiding this comment

leo-gan May 4, 2023

Choose a reason for hiding this comment

hwchase17 commented May 4, 2023

hwchase17 commented May 5, 2023

eyurtsev commented May 5, 2023

leo-gan commented May 8, 2023

leo-gan commented May 12, 2023

hwchase17 left a comment

Choose a reason for hiding this comment

leo-gan commented May 3, 2023 •

edited

Loading