Add SharePoint Loader #4284

netoferraz · 2023-05-07T12:02:19Z

Added a loader (SharePointLoader) that can pull documents (pdf, docx, doc) from the SharePoint Document Library.
Added a Base Loader (O365BaseLoader) to be used for all Loaders that use O365 Package
Code refactoring on OneDriveLoader to use the new O365BaseLoader.

…sharepoint

netoferraz · 2023-05-07T12:04:00Z

langchain/document_loaders/onedrive.py


-    def _get_folder_from_path(self, drive: Type[Drive]) -> Union[Folder, Drive]:


This method receives an instance of the class instead of the constructor.

netoferraz

If you guys need additional context or explanation, just le me know.

netoferraz · 2023-05-07T12:07:58Z

langchain/document_loaders/onedrive.py

    from O365.drive import Drive, Folder

 SCOPES = ["offline_access", "Files.Read.All"]
 logger = logging.getLogger(__name__)


-class _OneDriveSettings(BaseSettings):


Moved to O365BaseLoader

netoferraz · 2023-05-07T12:09:14Z

langchain/document_loaders/onedrive.py

-    def _auth(self) -> Type[Account]:
-        """
-        Authenticates the OneDrive API client using the specified
-        authentication method and returns the Account object.
-
-        Returns:
-            Type[Account]: The authenticated Account object.
-        """
-        try:
-            from O365 import FileSystemTokenBackend
-        except ImportError:
-            raise ValueError(
-                "O365 package not found, please install it with `pip install o365`"
-            )
-        if self.auth_with_token:
-            token_storage = _OneDriveTokenStorage()
-            token_path = token_storage.token_path
-            token_backend = FileSystemTokenBackend(
-                token_path=token_path.parent, token_filename=token_path.name
-            )
-            account = Account(
-                credentials=(
-                    self.settings.client_id,
-                    self.settings.client_secret.get_secret_value(),
-                ),
-                scopes=SCOPES,
-                token_backend=token_backend,
-                **{"raise_http_errors": False},
-            )
-        else:
-            token_backend = FileSystemTokenBackend(
-                token_path=Path.home() / ".credentials"
-            )
-            account = Account(
-                credentials=(
-                    self.settings.client_id,
-                    self.settings.client_secret.get_secret_value(),
-                ),
-                scopes=SCOPES,
-                token_backend=token_backend,
-                **{"raise_http_errors": False},
-            )
-            # make the auth
-            account.authenticate()
-        return account


Moved to O365BaseLoader

netoferraz · 2023-05-07T12:09:52Z

langchain/document_loaders/onedrive.py

-    def _load_from_folder(self, folder: Type[Folder]) -> List[Document]:
-        """
-        Loads all supported document files from the specified folder
-        and returns a list of Document objects.
-
-        Args:
-            folder (Type[Folder]): The folder object to load the documents from.
-
-        Returns:
-            List[Document]: A list of Document objects representing
-            the loaded documents.
-
-        """
-        docs = []
-        file_types = _SupportedFileTypes(file_types=["doc", "docx", "pdf"])
-        file_mime_types = file_types.fetch_mime_types()
-        items = folder.get_items()
-        with tempfile.TemporaryDirectory() as temp_dir:
-            file_path = f"{temp_dir}"
-            os.makedirs(os.path.dirname(file_path), exist_ok=True)
-            for file in items:
-                if file.is_file:
-                    if file.mime_type in list(file_mime_types.values()):
-                        loader = OneDriveFileLoader(file=file)
-                        docs.extend(loader.load())
-        return docs
-
-    def _load_from_object_ids(self, drive: Type[Drive]) -> List[Document]:
-        """
-        Loads all supported document files from the specified OneDrive
-        drive based on their object IDs and returns a list
-        of Document objects.
-
-        Args:
-            drive (Type[Drive]): The OneDrive drive object
-            to load the documents from.
-
-        Returns:
-            List[Document]: A list of Document objects representing
-            the loaded documents.
-        """
-        docs = []
-        file_types = _SupportedFileTypes(file_types=["doc", "docx", "pdf"])
-        file_mime_types = file_types.fetch_mime_types()
-        with tempfile.TemporaryDirectory() as temp_dir:
-            file_path = f"{temp_dir}"
-            os.makedirs(os.path.dirname(file_path), exist_ok=True)
-
-            for object_id in self.object_ids if self.object_ids else [""]:
-                file = drive.get_item(object_id)
-                if not file:
-                    logging.warning(
-                        "There isn't a file with"
-                        f"object_id {object_id} in drive {drive}."
-                    )
-                    continue
-                if file.is_file:
-                    if file.mime_type in list(file_mime_types.values()):
-                        loader = OneDriveFileLoader(file=file)
-                        docs.extend(loader.load())
-        return docs
-


Moved to O365BaseLoader

langchain/document_loaders/onedrive_file.py

langchain/document_loaders/sharepoint_file.py

hwchase17

signature of load should not be changed

eyurtsev · 2023-05-09T02:36:55Z

@netoferraz Thank you for the contribution! As a heads up I'm in the process of adding a few abstractions to the document flow to decouple loading from parsing code.

General strategy is here:
#2833 (comment)

TLDR; If you're able to implement a BlobLoader (interface and file system implementation shown below), it'll make it easier for users to add arbitrary parsers on top the loading interface.

class BlobLoader(ABC):
    """Abstract interface for blob loaders implementation.

    Implementer should be able to load raw content from a storage system according
    to some criteria and return the raw content lazily as a stream of blobs.
    """

    @abstractmethod
    def yield_blobs(
        self,
    ) -> Iterable[Blob]:
        """A lazy loader for raw data represented by LangChain's Blob object.

        Returns:
            A generator over blobs
        """

Implementation for local file system:

https://github.com/hwchase17/langchain/blob/master/langchain/document_loaders/blob_loaders/file_system.py#L39

netoferraz · 2023-05-10T12:00:50Z

@netoferraz Thank you for the contribution! As a heads up I'm in the process of adding a few abstractions to the document flow to decouple loading from parsing code.

General strategy is here: #2833 (comment)

TLDR; If you're able to implement a BlobLoader (interface and file system implementation shown below), it'll make it easier for users to add arbitrary parsers on top the loading interface.
class BlobLoader(ABC):
    """Abstract interface for blob loaders implementation.

    Implementer should be able to load raw content from a storage system according
    to some criteria and return the raw content lazily as a stream of blobs.
    """

    @abstractmethod
    def yield_blobs(
        self,
    ) -> Iterable[Blob]:
        """A lazy loader for raw data represented by LangChain's Blob object.

        Returns:
            A generator over blobs
        """
Implementation for local file system:

https://github.com/hwchase17/langchain/blob/master/langchain/document_loaders/blob_loaders/file_system.py#L39

Hi @eyurtsev ! I'll try to bring those concepts to this loader and implement it.

…sharepoint

netoferraz · 2023-06-05T01:08:57Z

@netoferraz Thank you for the contribution! As a heads up I'm in the process of adding a few abstractions to the document flow to decouple loading from parsing code.
General strategy is here: #2833 (comment)
TLDR; If you're able to implement a BlobLoader (interface and file system implementation shown below), it'll make it easier for users to add arbitrary parsers on top the loading interface.
class BlobLoader(ABC):
    """Abstract interface for blob loaders implementation.

    Implementer should be able to load raw content from a storage system according
    to some criteria and return the raw content lazily as a stream of blobs.
    """

    @abstractmethod
    def yield_blobs(
        self,
    ) -> Iterable[Blob]:
        """A lazy loader for raw data represented by LangChain's Blob object.

        Returns:
            A generator over blobs
        """
Implementation for local file system:
https://github.com/hwchase17/langchain/blob/master/langchain/document_loaders/blob_loaders/file_system.py#L39
Hi @eyurtsev ! I'll try to bring those concepts to this loader and implement it.

Hey @eyurtsev and @hwchase17. Finally, I had a chance to work again on this PR. I decoupled the loading and parser process using FileSystemBlobLoader and BaseBlobParser. Could you guys review it ?

HoiDam · 2023-06-09T08:15:48Z

Are you able to solve the issue? =]

netoferraz · 2023-06-09T10:35:38Z

Are you able to solve the issue? =]

Hi @HoiDam. Are you talking about that comment (#4284 (review)) ? This comment is outdated because that file does not even exists anymore.

laveshnk-crypto · 2023-06-13T09:59:00Z

Hi @hwchase17 and @eyurtsev can this be reviewed and merged soon?

willemmulder · 2023-07-03T18:52:09Z

Did anyone have a chance to look at this? Would love to have this merged :-)

guidorietbroek · 2023-07-04T04:56:19Z

@netoferraz great initiative!

Is this loader adding metadata from SPO, it’s possible in the native module O365 .

vicondoa · 2023-08-16T18:13:33Z

Any thoughts on resurrecting this so it can be merged? I'd like to send a PR to improve some of the auth handling and didn't want to step on this because I'd also like Sharepoint support. Thanks!

baskaryan · 2023-08-18T18:57:11Z

Any thoughts on resurrecting this so it can be merged? I'd like to send a PR to improve some of the auth handling and didn't want to step on this because I'd also like Sharepoint support. Thanks!

yes! working to revive

netoferraz · 2023-08-18T19:07:50Z

Yeah! @baskaryan ! Are you guys willing to accept this PR? Please, just clarify to me what needs to be done, ok? By the way, thank you @guidorietbroek

netoferraz · 2023-08-18T19:09:52Z

Any thoughts on resurrecting this so it can be merged? I'd like to send a PR to improve some of the auth handling and didn't want to step on this because I'd also like Sharepoint support. Thanks!

It would be great if we can move forward with this PR! I'm not sure if the maintainers have any intentions to accept this work.

vercel · 2023-08-18T19:37:42Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Aug 19, 2023 0:55am

fixed

baskaryan · 2023-08-19T00:58:30Z

updated, would love one more review @netoferraz and @eyurtsev!

eyurtsev · 2023-08-19T02:18:57Z

libs/langchain/langchain/document_loaders/base_o365.py

+    PDF = "pdf"
+
+
+def fetch_mime_types(file_types: Sequence[_FileType]) -> Dict[str, str]:


Could we replace this function with this dict?

EXTENSION_TO_MIMETYPE = { "doc": "application/msword", "docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "pdf": "application/pdf", }

eyurtsev · 2023-08-19T02:40:07Z

libs/langchain/langchain/document_loaders/base_o365.py

+    return mime_types_mapping
+
+
+class O365BaseLoader(BaseLoader, BaseModel):


Haven't worked with O365 before. What type of stuff is accessible except for sharepoint? (Trying to understand why inheritance is needed)

eyurtsev · 2023-08-19T02:43:36Z

libs/langchain/langchain/document_loaders/sharepoint.py

+    """ The IDs of the objects to load data from."""
+
+    @property
+    def _file_types(self) -> Sequence[_FileType]:


The new blob loaders abstraction helps to prevent hard-coding knowledge of parsing into content fetching. This makes the loading code a lot easier to reuse.

The loader should take 2 attributes that should be part of the initializer.

Blob parser

extensions

eyurtsev · 2023-08-19T02:44:41Z

libs/langchain/langchain/document_loaders/sharepoint.py

+        return _FileType.DOC, _FileType.DOCX, _FileType.PDF
+
+    @property
+    def _scopes(self) -> List[str]:


Could we expose scopes as an attribute and then handle it via the root validator to assign it with a reasonable default?

eyurtsev · 2023-08-19T02:49:00Z

libs/langchain/langchain/document_loaders/sharepoint.py

+from langchain.docstore.document import Document
+from langchain.document_loaders.base_o365 import (
+    O365BaseLoader,
+    _FileType,


(nit) importing a private attribute, maybe make it public?

eyurtsev · 2023-08-19T02:50:07Z

libs/langchain/langchain/document_loaders/sharepoint.py

+            raise ImportError(
+                "O365 package not found, please install it with `pip install o365`"
+            )
+        drive = self._auth().storage().get_drive(self.document_library_id)


could move data fetching to base o365, and replace the share point loader with the generic loader which will allow users to swap out parsing strategies

eyurtsev · 2023-08-19T02:52:17Z

libs/langchain/langchain/document_loaders/base_o365.py

+                    continue
+                if file.is_file:
+                    if file.mime_type in list(file_mime_types.values()):
+                        file.download(to_path=temp_dir, chunk_size=self.chunk_size)


can the file be fetched to memory and then this can yield a Blob directly?

eyurtsev · 2023-08-19T02:54:54Z

libs/langchain/langchain/document_loaders/base_o365.py

+
+    @property
+    @abstractmethod
+    def _scopes(self) -> List[str]:


Could we convert the scopes into attributes rather than properties? They can be set using root validators if using pydantic or via the init if using vanilla classes

eyurtsev · 2023-08-19T02:58:34Z

libs/langchain/langchain/document_loaders/base_o365.py

+
+    @property
+    @abstractmethod
+    def _file_types(self) -> Sequence[_FileType]:


@baskaryan this could be a good candidate of a class to rewrite using the blob generators

Single blob generator takes in its interface:

auth or settings for auth

scopes

filters:

drive

object ids and/or folder

file extensions

chunk_size -- for large file downloads

max file size -- limit to file size

show progress (bool) -- progress indicator

it yields Blobs

Then can be combined with Generic Loader which will allow decoupling the hard-coding between parsers and content fetching

eyurtsev · 2023-08-19T03:20:59Z

@netoferraz 👋 thank you for the contribution! I left a few comments on the PR, overall looks good to me, so okay merging as is. (cc @baskaryan )

There's a BlobLoader abstraction in the codebase that would fit the requirements here pretty well with an implementation for the file system called FileSystemBlobLoader that can be replicated here. The way it would look would be to declare something like

O365BlobLoader, it will take a bunch of attribtues in the init like auth, and filters, and yield blobs.

Then one could compose it with GenericLoader to apply any sort of parser to content that can be fetched from O365

Not a requirement for merging this PR as we can re-use the existing code at a later point. :)

netoferraz · 2023-08-19T12:24:58Z

@netoferraz 👋 thank you for the contribution! I left a few comments on the PR, overall looks good to me, so okay merging as is. (cc @baskaryan )

There's a BlobLoader abstraction in the codebase that would fit the requirements here pretty well with an implementation for the file system called FileSystemBlobLoader that can be replicated here. The way it would look would be to declare something like

O365BlobLoader, it will take a bunch of attribtues in the init like auth, and filters, and yield blobs.

Then one could compose it with GenericLoader to apply any sort of parser to content that can be fetched from O365

Not a requirement for merging this PR as we can re-use the existing code at a later point. :)

Thank you, @eyurtsev ! @baskaryan If you understand that we need to do some additional work based on the @eyurtsev review, let me know, ok? Otherwise, seems we could move to approve this work.

netoferraz added 3 commits May 6, 2023 08:58

add O365BaseLoader

26a9fcf

add SharePointLoader

c895269

Merge branch 'master' of https://github.com/hwchase17/langchain into …

16afc9b

…sharepoint

netoferraz commented May 7, 2023

View reviewed changes

add SharePointLoader example

3c385e0

netoferraz force-pushed the sharepoint branch from d34ae8a to 3c385e0 Compare May 7, 2023 12:06

netoferraz commented May 7, 2023

View reviewed changes

eyurtsev self-requested a review May 7, 2023 15:34

hwchase17 approved these changes May 8, 2023

View reviewed changes

hwchase17 previously requested changes May 8, 2023

View reviewed changes

Merge branch 'master' of https://github.com/hwchase17/langchain into …

b69632e

…sharepoint

hwchase17 changed the base branch from master to harrison/sharepoint May 15, 2023 01:09

hwchase17 added the needs work PRs that need more work label May 15, 2023

netoferraz added 3 commits June 3, 2023 20:57

Merge branch 'master' of https://github.com/hwchase17/langchain into …

7555111

…sharepoint

add MsWordParser

0e87b53

update OneDriveLoader & SharePointLoader

1a0bb61

netoferraz changed the base branch from harrison/sharepoint to master June 5, 2023 01:01

Merge branch 'master' of https://github.com/hwchase17/langchain into …

3d737c8

…sharepoint

netoferraz force-pushed the sharepoint branch from cf42f6c to 3d737c8 Compare June 5, 2023 01:04

netoferraz mentioned this pull request Jun 9, 2023

New Document loader request for Sharepoint , OneDrive and Documentum #2153

Closed

netoferraz requested a review from hwchase17 June 10, 2023 21:14

baskaryan assigned eyurtsev Aug 1, 2023

baskaryan assigned rlancemartin Aug 1, 2023

merge

5e96bbf

vercel bot deployed to Preview – langchain August 18, 2023 19:48 View deployment

baskaryan added 4 commits August 18, 2023 15:25

pydantic

1e73150

cr

6ba137f

fix

9819cf5

Merge branch 'master' into sharepoint

8f659e3

vercel bot deployed to Preview – langchain August 19, 2023 00:55 View deployment

eyurtsev reviewed Aug 19, 2023

View reviewed changes

baskaryan merged commit f116e10 into langchain-ai:master Aug 21, 2023
22 checks passed

pseudotensor mentioned this pull request Aug 23, 2023

sharepoint h2oai/h2ogpt#740

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SharePoint Loader #4284

Add SharePoint Loader #4284

netoferraz commented May 7, 2023

netoferraz May 7, 2023

netoferraz left a comment

netoferraz May 7, 2023

netoferraz May 7, 2023

netoferraz May 7, 2023

hwchase17 left a comment

eyurtsev commented May 9, 2023 •

edited

Loading

netoferraz commented May 10, 2023

netoferraz commented Jun 5, 2023

HoiDam commented Jun 9, 2023

netoferraz commented Jun 9, 2023

laveshnk-crypto commented Jun 13, 2023

willemmulder commented Jul 3, 2023

guidorietbroek commented Jul 4, 2023

vicondoa commented Aug 16, 2023

baskaryan commented Aug 18, 2023

netoferraz commented Aug 18, 2023

netoferraz commented Aug 18, 2023

vercel bot commented Aug 18, 2023 •

edited

Loading

baskaryan commented Aug 19, 2023

eyurtsev Aug 19, 2023

eyurtsev Aug 19, 2023

eyurtsev Aug 19, 2023

eyurtsev Aug 19, 2023 •

edited

Loading

eyurtsev Aug 19, 2023 •

edited

Loading

eyurtsev Aug 19, 2023

eyurtsev Aug 19, 2023

eyurtsev Aug 19, 2023

eyurtsev Aug 19, 2023

eyurtsev commented Aug 19, 2023

netoferraz commented Aug 19, 2023


		def _get_folder_from_path(self, drive: Type[Drive]) -> Union[Folder, Drive]:

		PDF = "pdf"


		def fetch_mime_types(file_types: Sequence[_FileType]) -> Dict[str, str]:

		return mime_types_mapping


		class O365BaseLoader(BaseLoader, BaseModel):

Add SharePoint Loader #4284

Add SharePoint Loader #4284

Conversation

netoferraz commented May 7, 2023

Choose a reason for hiding this comment

netoferraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hwchase17 left a comment

Choose a reason for hiding this comment

eyurtsev commented May 9, 2023 • edited Loading

netoferraz commented May 10, 2023

netoferraz commented Jun 5, 2023

HoiDam commented Jun 9, 2023

netoferraz commented Jun 9, 2023

laveshnk-crypto commented Jun 13, 2023

willemmulder commented Jul 3, 2023

guidorietbroek commented Jul 4, 2023

vicondoa commented Aug 16, 2023

baskaryan commented Aug 18, 2023

netoferraz commented Aug 18, 2023

netoferraz commented Aug 18, 2023

vercel bot commented Aug 18, 2023 • edited Loading

baskaryan commented Aug 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eyurtsev Aug 19, 2023 • edited Loading

Choose a reason for hiding this comment

eyurtsev Aug 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eyurtsev commented Aug 19, 2023

netoferraz commented Aug 19, 2023

eyurtsev commented May 9, 2023 •

edited

Loading

vercel bot commented Aug 18, 2023 •

edited

Loading

eyurtsev Aug 19, 2023 •

edited

Loading

eyurtsev Aug 19, 2023 •

edited

Loading