Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SharePoint Loader #4284

Merged
merged 14 commits into from
Aug 21, 2023
Merged

Add SharePoint Loader #4284

merged 14 commits into from
Aug 21, 2023

Conversation

netoferraz
Copy link
Contributor

  • Added a loader (SharePointLoader) that can pull documents (pdf, docx, doc) from the SharePoint Document Library.
  • Added a Base Loader (O365BaseLoader) to be used for all Loaders that use O365 Package
  • Code refactoring on OneDriveLoader to use the new O365BaseLoader.


def _get_folder_from_path(self, drive: Type[Drive]) -> Union[Folder, Drive]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method receives an instance of the class instead of the constructor.

Copy link
Contributor Author

@netoferraz netoferraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you guys need additional context or explanation, just le me know.

from O365.drive import Drive, Folder

SCOPES = ["offline_access", "Files.Read.All"]
logger = logging.getLogger(__name__)


class _OneDriveSettings(BaseSettings):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to O365BaseLoader

Comment on lines 69 to 113
def _auth(self) -> Type[Account]:
"""
Authenticates the OneDrive API client using the specified
authentication method and returns the Account object.

Returns:
Type[Account]: The authenticated Account object.
"""
try:
from O365 import FileSystemTokenBackend
except ImportError:
raise ValueError(
"O365 package not found, please install it with `pip install o365`"
)
if self.auth_with_token:
token_storage = _OneDriveTokenStorage()
token_path = token_storage.token_path
token_backend = FileSystemTokenBackend(
token_path=token_path.parent, token_filename=token_path.name
)
account = Account(
credentials=(
self.settings.client_id,
self.settings.client_secret.get_secret_value(),
),
scopes=SCOPES,
token_backend=token_backend,
**{"raise_http_errors": False},
)
else:
token_backend = FileSystemTokenBackend(
token_path=Path.home() / ".credentials"
)
account = Account(
credentials=(
self.settings.client_id,
self.settings.client_secret.get_secret_value(),
),
scopes=SCOPES,
token_backend=token_backend,
**{"raise_http_errors": False},
)
# make the auth
account.authenticate()
return account
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to O365BaseLoader

Comment on lines 148 to 209
def _load_from_folder(self, folder: Type[Folder]) -> List[Document]:
"""
Loads all supported document files from the specified folder
and returns a list of Document objects.

Args:
folder (Type[Folder]): The folder object to load the documents from.

Returns:
List[Document]: A list of Document objects representing
the loaded documents.

"""
docs = []
file_types = _SupportedFileTypes(file_types=["doc", "docx", "pdf"])
file_mime_types = file_types.fetch_mime_types()
items = folder.get_items()
with tempfile.TemporaryDirectory() as temp_dir:
file_path = f"{temp_dir}"
os.makedirs(os.path.dirname(file_path), exist_ok=True)
for file in items:
if file.is_file:
if file.mime_type in list(file_mime_types.values()):
loader = OneDriveFileLoader(file=file)
docs.extend(loader.load())
return docs

def _load_from_object_ids(self, drive: Type[Drive]) -> List[Document]:
"""
Loads all supported document files from the specified OneDrive
drive based on their object IDs and returns a list
of Document objects.

Args:
drive (Type[Drive]): The OneDrive drive object
to load the documents from.

Returns:
List[Document]: A list of Document objects representing
the loaded documents.
"""
docs = []
file_types = _SupportedFileTypes(file_types=["doc", "docx", "pdf"])
file_mime_types = file_types.fetch_mime_types()
with tempfile.TemporaryDirectory() as temp_dir:
file_path = f"{temp_dir}"
os.makedirs(os.path.dirname(file_path), exist_ok=True)

for object_id in self.object_ids if self.object_ids else [""]:
file = drive.get_item(object_id)
if not file:
logging.warning(
"There isn't a file with"
f"object_id {object_id} in drive {drive}."
)
continue
if file.is_file:
if file.mime_type in list(file_mime_types.values()):
loader = OneDriveFileLoader(file=file)
docs.extend(loader.load())
return docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to O365BaseLoader

langchain/document_loaders/onedrive_file.py Outdated Show resolved Hide resolved
langchain/document_loaders/sharepoint_file.py Outdated Show resolved Hide resolved
@eyurtsev eyurtsev self-requested a review May 7, 2023 15:34
Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signature of load should not be changed

@eyurtsev
Copy link
Collaborator

eyurtsev commented May 9, 2023

@netoferraz Thank you for the contribution! As a heads up I'm in the process of adding a few abstractions to the document flow to decouple loading from parsing code.

General strategy is here:
#2833 (comment)

TLDR; If you're able to implement a BlobLoader (interface and file system implementation shown below), it'll make it easier for users to add arbitrary parsers on top the loading interface.

class BlobLoader(ABC):
    """Abstract interface for blob loaders implementation.

    Implementer should be able to load raw content from a storage system according
    to some criteria and return the raw content lazily as a stream of blobs.
    """

    @abstractmethod
    def yield_blobs(
        self,
    ) -> Iterable[Blob]:
        """A lazy loader for raw data represented by LangChain's Blob object.

        Returns:
            A generator over blobs
        """

Implementation for local file system:

https://github.com/hwchase17/langchain/blob/master/langchain/document_loaders/blob_loaders/file_system.py#L39

@netoferraz
Copy link
Contributor Author

@netoferraz Thank you for the contribution! As a heads up I'm in the process of adding a few abstractions to the document flow to decouple loading from parsing code.

General strategy is here: #2833 (comment)

TLDR; If you're able to implement a BlobLoader (interface and file system implementation shown below), it'll make it easier for users to add arbitrary parsers on top the loading interface.

class BlobLoader(ABC):
    """Abstract interface for blob loaders implementation.

    Implementer should be able to load raw content from a storage system according
    to some criteria and return the raw content lazily as a stream of blobs.
    """

    @abstractmethod
    def yield_blobs(
        self,
    ) -> Iterable[Blob]:
        """A lazy loader for raw data represented by LangChain's Blob object.

        Returns:
            A generator over blobs
        """

Implementation for local file system:

https://github.com/hwchase17/langchain/blob/master/langchain/document_loaders/blob_loaders/file_system.py#L39

Hi @eyurtsev ! I'll try to bring those concepts to this loader and implement it.

@hwchase17 hwchase17 changed the base branch from master to harrison/sharepoint May 15, 2023 01:09
@hwchase17 hwchase17 added the needs work PRs that need more work label May 15, 2023
@netoferraz netoferraz changed the base branch from harrison/sharepoint to master June 5, 2023 01:01
@netoferraz
Copy link
Contributor Author

@netoferraz Thank you for the contribution! As a heads up I'm in the process of adding a few abstractions to the document flow to decouple loading from parsing code.
General strategy is here: #2833 (comment)
TLDR; If you're able to implement a BlobLoader (interface and file system implementation shown below), it'll make it easier for users to add arbitrary parsers on top the loading interface.

class BlobLoader(ABC):
    """Abstract interface for blob loaders implementation.

    Implementer should be able to load raw content from a storage system according
    to some criteria and return the raw content lazily as a stream of blobs.
    """

    @abstractmethod
    def yield_blobs(
        self,
    ) -> Iterable[Blob]:
        """A lazy loader for raw data represented by LangChain's Blob object.

        Returns:
            A generator over blobs
        """

Implementation for local file system:
https://github.com/hwchase17/langchain/blob/master/langchain/document_loaders/blob_loaders/file_system.py#L39

Hi @eyurtsev ! I'll try to bring those concepts to this loader and implement it.

Hey @eyurtsev and @hwchase17. Finally, I had a chance to work again on this PR. I decoupled the loading and parser process using FileSystemBlobLoader and BaseBlobParser. Could you guys review it ?

@HoiDam
Copy link

HoiDam commented Jun 9, 2023

Are you able to solve the issue? =]

@netoferraz
Copy link
Contributor Author

Are you able to solve the issue? =]

Hi @HoiDam. Are you talking about that comment (#4284 (review)) ? This comment is outdated because that file does not even exists anymore.

@laveshnk-crypto
Copy link

Hi @hwchase17 and @eyurtsev can this be reviewed and merged soon?

@willemmulder
Copy link

Did anyone have a chance to look at this? Would love to have this merged :-)

@guidorietbroek
Copy link

@netoferraz great initiative!

Is this loader adding metadata from SPO, it’s possible in the native module O365 .

@vicondoa
Copy link

Any thoughts on resurrecting this so it can be merged? I'd like to send a PR to improve some of the auth handling and didn't want to step on this because I'd also like Sharepoint support. Thanks!

@baskaryan
Copy link
Collaborator

Any thoughts on resurrecting this so it can be merged? I'd like to send a PR to improve some of the auth handling and didn't want to step on this because I'd also like Sharepoint support. Thanks!

yes! working to revive

@netoferraz
Copy link
Contributor Author

Yeah! @baskaryan ! Are you guys willing to accept this PR? Please, just clarify to me what needs to be done, ok? By the way, thank you @guidorietbroek

@netoferraz
Copy link
Contributor Author

Any thoughts on resurrecting this so it can be merged? I'd like to send a PR to improve some of the auth handling and didn't want to step on this because I'd also like Sharepoint support. Thanks!

It would be great if we can move forward with this PR! I'm not sure if the maintainers have any intentions to accept this work.

@vercel
Copy link

vercel bot commented Aug 18, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 19, 2023 0:55am

@baskaryan
Copy link
Collaborator

updated, would love one more review @netoferraz and @eyurtsev!

PDF = "pdf"


def fetch_mime_types(file_types: Sequence[_FileType]) -> Dict[str, str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we replace this function with this dict?

EXTENSION_TO_MIMETYPE = {
    "doc": "application/msword",
    "docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    "pdf": "application/pdf",
}

return mime_types_mapping


class O365BaseLoader(BaseLoader, BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't worked with O365 before. What type of stuff is accessible except for sharepoint? (Trying to understand why inheritance is needed)

""" The IDs of the objects to load data from."""

@property
def _file_types(self) -> Sequence[_FileType]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new blob loaders abstraction helps to prevent hard-coding knowledge of parsing into content fetching. This makes the loading code a lot easier to reuse.

The loader should take 2 attributes that should be part of the initializer.

  1. Blob parser
  2. extensions

return _FileType.DOC, _FileType.DOCX, _FileType.PDF

@property
def _scopes(self) -> List[str]:
Copy link
Collaborator

@eyurtsev eyurtsev Aug 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we expose scopes as an attribute and then handle it via the root validator to assign it with a reasonable default?

from langchain.docstore.document import Document
from langchain.document_loaders.base_o365 import (
O365BaseLoader,
_FileType,
Copy link
Collaborator

@eyurtsev eyurtsev Aug 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) importing a private attribute, maybe make it public?

raise ImportError(
"O365 package not found, please install it with `pip install o365`"
)
drive = self._auth().storage().get_drive(self.document_library_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could move data fetching to base o365, and replace the share point loader with the generic loader which will allow users to swap out parsing strategies

continue
if file.is_file:
if file.mime_type in list(file_mime_types.values()):
file.download(to_path=temp_dir, chunk_size=self.chunk_size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the file be fetched to memory and then this can yield a Blob directly?


@property
@abstractmethod
def _scopes(self) -> List[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we convert the scopes into attributes rather than properties? They can be set using root validators if using pydantic or via the init if using vanilla classes


@property
@abstractmethod
def _file_types(self) -> Sequence[_FileType]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baskaryan this could be a good candidate of a class to rewrite using the blob generators

Single blob generator takes in its interface:

  • auth or settings for auth
  • scopes

filters:

  • drive
  • object ids and/or folder
  • file extensions
  • chunk_size -- for large file downloads
  • max file size -- limit to file size
  • show progress (bool) -- progress indicator

it yields Blobs


Then can be combined with Generic Loader which will allow decoupling the hard-coding between parsers and content fetching

@eyurtsev
Copy link
Collaborator

@netoferraz 👋 thank you for the contribution! I left a few comments on the PR, overall looks good to me, so okay merging as is. (cc @baskaryan )

There's a BlobLoader abstraction in the codebase that would fit the requirements here pretty well with an implementation for the file system called FileSystemBlobLoader that can be replicated here. The way it would look would be to declare something like

O365BlobLoader, it will take a bunch of attribtues in the init like auth, and filters, and yield blobs.

Then one could compose it with GenericLoader to apply any sort of parser to content that can be fetched from O365

Not a requirement for merging this PR as we can re-use the existing code at a later point. :)

@netoferraz
Copy link
Contributor Author

@netoferraz 👋 thank you for the contribution! I left a few comments on the PR, overall looks good to me, so okay merging as is. (cc @baskaryan )

There's a BlobLoader abstraction in the codebase that would fit the requirements here pretty well with an implementation for the file system called FileSystemBlobLoader that can be replicated here. The way it would look would be to declare something like

O365BlobLoader, it will take a bunch of attribtues in the init like auth, and filters, and yield blobs.

Then one could compose it with GenericLoader to apply any sort of parser to content that can be fetched from O365

Not a requirement for merging this PR as we can re-use the existing code at a later point. :)

Thank you, @eyurtsev ! @baskaryan If you understand that we need to do some additional work based on the @eyurtsev review, let me know, ok? Otherwise, seems we could move to approve this work.

@baskaryan baskaryan merged commit f116e10 into langchain-ai:master Aug 21, 2023
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs work PRs that need more work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants