-
Notifications
You must be signed in to change notification settings - Fork 15.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SharePoint Loader #4284
Add SharePoint Loader #4284
Conversation
|
||
def _get_folder_from_path(self, drive: Type[Drive]) -> Union[Folder, Drive]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method receives an instance of the class instead of the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you guys need additional context or explanation, just le me know.
from O365.drive import Drive, Folder | ||
|
||
SCOPES = ["offline_access", "Files.Read.All"] | ||
logger = logging.getLogger(__name__) | ||
|
||
|
||
class _OneDriveSettings(BaseSettings): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to O365BaseLoader
def _auth(self) -> Type[Account]: | ||
""" | ||
Authenticates the OneDrive API client using the specified | ||
authentication method and returns the Account object. | ||
|
||
Returns: | ||
Type[Account]: The authenticated Account object. | ||
""" | ||
try: | ||
from O365 import FileSystemTokenBackend | ||
except ImportError: | ||
raise ValueError( | ||
"O365 package not found, please install it with `pip install o365`" | ||
) | ||
if self.auth_with_token: | ||
token_storage = _OneDriveTokenStorage() | ||
token_path = token_storage.token_path | ||
token_backend = FileSystemTokenBackend( | ||
token_path=token_path.parent, token_filename=token_path.name | ||
) | ||
account = Account( | ||
credentials=( | ||
self.settings.client_id, | ||
self.settings.client_secret.get_secret_value(), | ||
), | ||
scopes=SCOPES, | ||
token_backend=token_backend, | ||
**{"raise_http_errors": False}, | ||
) | ||
else: | ||
token_backend = FileSystemTokenBackend( | ||
token_path=Path.home() / ".credentials" | ||
) | ||
account = Account( | ||
credentials=( | ||
self.settings.client_id, | ||
self.settings.client_secret.get_secret_value(), | ||
), | ||
scopes=SCOPES, | ||
token_backend=token_backend, | ||
**{"raise_http_errors": False}, | ||
) | ||
# make the auth | ||
account.authenticate() | ||
return account |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to O365BaseLoader
def _load_from_folder(self, folder: Type[Folder]) -> List[Document]: | ||
""" | ||
Loads all supported document files from the specified folder | ||
and returns a list of Document objects. | ||
|
||
Args: | ||
folder (Type[Folder]): The folder object to load the documents from. | ||
|
||
Returns: | ||
List[Document]: A list of Document objects representing | ||
the loaded documents. | ||
|
||
""" | ||
docs = [] | ||
file_types = _SupportedFileTypes(file_types=["doc", "docx", "pdf"]) | ||
file_mime_types = file_types.fetch_mime_types() | ||
items = folder.get_items() | ||
with tempfile.TemporaryDirectory() as temp_dir: | ||
file_path = f"{temp_dir}" | ||
os.makedirs(os.path.dirname(file_path), exist_ok=True) | ||
for file in items: | ||
if file.is_file: | ||
if file.mime_type in list(file_mime_types.values()): | ||
loader = OneDriveFileLoader(file=file) | ||
docs.extend(loader.load()) | ||
return docs | ||
|
||
def _load_from_object_ids(self, drive: Type[Drive]) -> List[Document]: | ||
""" | ||
Loads all supported document files from the specified OneDrive | ||
drive based on their object IDs and returns a list | ||
of Document objects. | ||
|
||
Args: | ||
drive (Type[Drive]): The OneDrive drive object | ||
to load the documents from. | ||
|
||
Returns: | ||
List[Document]: A list of Document objects representing | ||
the loaded documents. | ||
""" | ||
docs = [] | ||
file_types = _SupportedFileTypes(file_types=["doc", "docx", "pdf"]) | ||
file_mime_types = file_types.fetch_mime_types() | ||
with tempfile.TemporaryDirectory() as temp_dir: | ||
file_path = f"{temp_dir}" | ||
os.makedirs(os.path.dirname(file_path), exist_ok=True) | ||
|
||
for object_id in self.object_ids if self.object_ids else [""]: | ||
file = drive.get_item(object_id) | ||
if not file: | ||
logging.warning( | ||
"There isn't a file with" | ||
f"object_id {object_id} in drive {drive}." | ||
) | ||
continue | ||
if file.is_file: | ||
if file.mime_type in list(file_mime_types.values()): | ||
loader = OneDriveFileLoader(file=file) | ||
docs.extend(loader.load()) | ||
return docs | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to O365BaseLoader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
signature of load should not be changed
@netoferraz Thank you for the contribution! As a heads up I'm in the process of adding a few abstractions to the document flow to decouple loading from parsing code. General strategy is here: TLDR; If you're able to implement a class BlobLoader(ABC):
"""Abstract interface for blob loaders implementation.
Implementer should be able to load raw content from a storage system according
to some criteria and return the raw content lazily as a stream of blobs.
"""
@abstractmethod
def yield_blobs(
self,
) -> Iterable[Blob]:
"""A lazy loader for raw data represented by LangChain's Blob object.
Returns:
A generator over blobs
""" Implementation for local file system: |
Hi @eyurtsev ! I'll try to bring those concepts to this loader and implement it. |
Hey @eyurtsev and @hwchase17. Finally, I had a chance to work again on this PR. I decoupled the loading and parser process using |
Are you able to solve the issue? =] |
Hi @HoiDam. Are you talking about that comment (#4284 (review)) ? This comment is outdated because that file does not even exists anymore. |
Hi @hwchase17 and @eyurtsev can this be reviewed and merged soon? |
Did anyone have a chance to look at this? Would love to have this merged :-) |
@netoferraz great initiative! Is this loader adding metadata from SPO, it’s possible in the native module O365 . |
Any thoughts on resurrecting this so it can be merged? I'd like to send a PR to improve some of the auth handling and didn't want to step on this because I'd also like Sharepoint support. Thanks! |
yes! working to revive |
Yeah! @baskaryan ! Are you guys willing to accept this PR? Please, just clarify to me what needs to be done, ok? By the way, thank you @guidorietbroek |
It would be great if we can move forward with this PR! I'm not sure if the maintainers have any intentions to accept this work. |
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
updated, would love one more review @netoferraz and @eyurtsev! |
PDF = "pdf" | ||
|
||
|
||
def fetch_mime_types(file_types: Sequence[_FileType]) -> Dict[str, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we replace this function with this dict?
EXTENSION_TO_MIMETYPE = {
"doc": "application/msword",
"docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"pdf": "application/pdf",
}
return mime_types_mapping | ||
|
||
|
||
class O365BaseLoader(BaseLoader, BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't worked with O365 before. What type of stuff is accessible except for sharepoint? (Trying to understand why inheritance is needed)
""" The IDs of the objects to load data from.""" | ||
|
||
@property | ||
def _file_types(self) -> Sequence[_FileType]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new blob loaders abstraction helps to prevent hard-coding knowledge of parsing into content fetching. This makes the loading code a lot easier to reuse.
The loader should take 2 attributes that should be part of the initializer.
- Blob parser
- extensions
return _FileType.DOC, _FileType.DOCX, _FileType.PDF | ||
|
||
@property | ||
def _scopes(self) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we expose scopes as an attribute and then handle it via the root validator to assign it with a reasonable default?
from langchain.docstore.document import Document | ||
from langchain.document_loaders.base_o365 import ( | ||
O365BaseLoader, | ||
_FileType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) importing a private attribute, maybe make it public?
raise ImportError( | ||
"O365 package not found, please install it with `pip install o365`" | ||
) | ||
drive = self._auth().storage().get_drive(self.document_library_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could move data fetching to base o365, and replace the share point loader with the generic loader which will allow users to swap out parsing strategies
continue | ||
if file.is_file: | ||
if file.mime_type in list(file_mime_types.values()): | ||
file.download(to_path=temp_dir, chunk_size=self.chunk_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the file be fetched to memory and then this can yield a Blob directly?
|
||
@property | ||
@abstractmethod | ||
def _scopes(self) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we convert the scopes into attributes rather than properties? They can be set using root validators if using pydantic or via the init if using vanilla classes
|
||
@property | ||
@abstractmethod | ||
def _file_types(self) -> Sequence[_FileType]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@baskaryan this could be a good candidate of a class to rewrite using the blob generators
Single blob generator takes in its interface:
- auth or settings for auth
- scopes
filters:
- drive
- object ids and/or folder
- file extensions
- chunk_size -- for large file downloads
- max file size -- limit to file size
- show progress (bool) -- progress indicator
it yields Blobs
Then can be combined with Generic Loader which will allow decoupling the hard-coding between parsers and content fetching
@netoferraz 👋 thank you for the contribution! I left a few comments on the PR, overall looks good to me, so okay merging as is. (cc @baskaryan ) There's a BlobLoader abstraction in the codebase that would fit the requirements here pretty well with an implementation for the file system called O365BlobLoader, it will take a bunch of attribtues in the init like auth, and filters, and yield blobs. Then one could compose it with Not a requirement for merging this PR as we can re-use the existing code at a later point. :) |
Thank you, @eyurtsev ! @baskaryan If you understand that we need to do some additional work based on the @eyurtsev review, let me know, ok? Otherwise, seems we could move to approve this work. |
SharePointLoader
) that can pull documents (pdf
,docx
,doc
) from the SharePoint Document Library.O365BaseLoader
) to be used for all Loaders that use O365 PackageOneDriveLoader
to use the newO365BaseLoader
.