Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a FSspec based ArtifactLoader #100

Merged
merged 8 commits into from
Mar 11, 2024
Merged

Implement a FSspec based ArtifactLoader #100

merged 8 commits into from
Mar 11, 2024

Conversation

maxmynter
Copy link
Collaborator

@maxmynter maxmynter commented Mar 7, 2024

Closes #94

Questions

  • Should fsspec become a first class dependecy such as tabulate? Or do we import it conditionally in the FilepathArtifactLoader?

self, path: str | os.PathLike[str], to_localdir: str | os.PathLike[str] | None = None
) -> None:
self.rpath = str(path)
self.lpath = Path(to_localdir or mkdtemp())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an acquired/created resource that needs proper cleanup, e.g. through a weakref finalizer (see e.g. https://github.com/python/cpython/blob/2647afeab694517933c53028d4ad427460b9e1fa/Lib/tempfile.py#L885-L888)

Path
The path to the artifact on the local filesystem.
"""
fs = spec.AbstractFileSystem()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filesystem should be constructed directly from the URI, see e.g. https://github.com/nicholasjng/shelf/blob/4a628c17e678524a49cb327238c3fb7924e0cb9d/src/shelf/util.py#L12.

@maxmynter maxmynter marked this pull request as ready for review March 8, 2024 15:58
Copy link
Collaborator

@nicholasjng nicholasjng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the story after I downloaded the remote files to my machine? I mean, what do I have to use to instantiate my model from the files?

pyproject.toml Outdated
@@ -31,7 +31,7 @@ classifiers = [
"Typing :: Typed",
]

dependencies = ["tabulate"]
dependencies = ["tabulate", "fsspec"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, we should make this optional (i.e. by importing fsspec only locally in the artifact load method), and facilitate local loading with builtin methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

from typing import Any, Callable, Generic, Iterable, Iterator, Literal, TypeVar

from fsspec import core, filesystem, utils
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please import the specific APIs from these submodules, not the whole modules themselves.


def test_load_local_file(local_file: Path, tmp_path: Path) -> None:
test_dir = tmp_path / "test_load_dir"
loader = FilePathArtifactLoader(str(local_file), str(test_dir))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work without the string casts.

def __init__(
self,
path: str | os.PathLike[str],
to_localdir: str | os.PathLike[str] | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a different name, it currently sounds like a bool. Maybe destination / basedir ?

Comment on lines 175 to 177
fs, _, _ = core.get_fs_token_paths(self.source_path)
if isinstance(fs, LocalFileSystem):
return Path(self.source_path).resolve()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this caching a la "check if it exists, only download if it doesn't"? (Otherwise it's probably currently missing)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More or less.

My understanding is, that this class, as a general filesystem class, should also be able to use local filepaths.
In that case, we don't want to download the data.

If it is a remote path, we download and put it either (a) a specified local dir, or (b) a tmp dir.

If a) the files are persisted on disk and the filepath can be used (by the user) in other places.

If b) the cleanup task deletes the tmp dir and the model to not clutter the filesystem.

That is how it is currently implemented. Do you agree with this?

@@ -58,6 +58,7 @@ jobs:
- name: Run tests on oldest supported Python
run: |
python -m pip install ".[dev]"
pip install fsspec
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just add it to the dev extra.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this line, then.

src/nnbench/types.py Outdated Show resolved Hide resolved
Comment on lines 175 to 177
raise ImportError(
"The FilePathArtifactLoader depends on fsspec. Do you have it installed?"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not helpful, you should instead give an install instruction:

Suggested change
raise ImportError(
"The FilePathArtifactLoader depends on fsspec. Do you have it installed?"
)
raise ImportError(
"class {self.__class__.__name__} requires `fsspec` to be installed. You can install it by running `python -m pip install --upgrade fsspec`"
)

Path
The path to the artifact on the local filesystem.
"""
fs, _, _ = self._get_fs_token_paths(self.source_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're ignoring the other two outputs, you could also use fs = fsspec.filesystem(self.source_path, **self.storage_options) (or with the options in another set of parentheses, not sure about the calling convention).

"""
fs, _, _ = self._get_fs_token_paths(self.source_path)
if isinstance(fs, self._LocalFileSystem):
return Path(self.source_path).resolve()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to copy files into your target folder even if they are local? Seems like you could just omit the branch here.

Comment on lines 175 to 184
def _get_protocol(self, path: str) -> str:
"""Lazy loading of fsspec.utils.get_protocol"""
try:
from fsspec.utils import get_protocol

return get_protocol(path)
except ImportError:
raise ImportError(
"class {self.__class__.__name__} requires `fsspec` to be installed. You can install it by running `python -m pip install --upgrade fsspec`"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things here:

  1. The single use in load() below does not warrant extracting these into their own APIs - please just import the fsspec versions both inline in load(), and scrap the private class methods.
  2. You can factor out the import error with a simple import guard at the top of the file, e.g. like so:
# this towards the top, after the imports.

try:
    import fsspec
    HAS_FSSPEC = True
except ImportError:
    HAS_FSSPEC = False

...
# later, in FilePathArtifactLoader.__init__():
...
if not HAS_FSSPEC:
    raise ModuleNotFoundError("class blah requires `fsspec`...")

@@ -58,6 +58,7 @@ jobs:
- name: Run tests on oldest supported Python
run: |
python -m pip install ".[dev]"
pip install fsspec
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this line, then.

Comment on lines 31 to 32
if TYPE_CHECKING:
import fsspec
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this mypy complaining? You could also add fsspec to the list of modules for which to ignore missing imports (in the pyproject.toml) if you like, since most of its interfaces are not great typing-wise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, was mypy. Changed to ignore in the .toml.

@maxmynter maxmynter merged commit 972bff8 into main Mar 11, 2024
5 checks passed
@maxmynter maxmynter deleted the fsspec-loader branch March 11, 2024 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement fsspec artifact loader for nnbench
2 participants