Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for caching, exporting, importing bucket list #549

Open
tchaton opened this issue Oct 11, 2023 · 4 comments
Open

Add support for caching, exporting, importing bucket list #549

tchaton opened this issue Oct 11, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@tchaton
Copy link

tchaton commented Oct 11, 2023

Tell us more about this new feature.

Hey there,

Assuming a dataset isn't going to change, I would like to configure mountpoint-s3 to keep the bucket listing in RAM or stored within a file (preferred as it can be collected and recycled for later re-mount).

Right now, I need to store the listing to a file to avoid occurring the cost of listing again (up to 20min on reasonable dataset).

    def _cached_list_filepaths(self, root: str) -> List[str]:
        algo = hashlib.new("sha256")
        algo.update(root.encode("utf-8"))
        root_hash = algo.hexdigest()

        filepath = f"/cache/{root_hash}/filepaths.txt"

        if os.path.exists(filepath):
            lines = []
            with open(filepath) as f:
                for line in f.readlines():
                    lines.append(line.replace("\n", ""))
            return lines

        filepaths = []
        for dirpath, _, filenames in os.walk(root):
            for filename in filenames:
                filepaths.append(os.path.join(dirpath, filename))

        os.makedirs(os.path.dirname(filepath), exist_ok=True)

        with open(filepath, "w") as f:
            for filepath in filepaths:
                f.write(f"{filepath}\n")

        return filepaths

Additionally, I would love to pay the cost of listing only one ever. Ideally, I would like the ability to export and import filesystem index.

@tchaton tchaton added the enhancement New feature or request label Oct 11, 2023
@tchaton tchaton changed the title Add support for caching bucket listing Add support for caching, exporting, importing bucket listing Oct 11, 2023
@tchaton tchaton changed the title Add support for caching, exporting, importing bucket listing Add support for caching, exporting, importing bucket list Oct 11, 2023
@dannycjones
Copy link
Contributor

Interesting feature request! Serving directory listing from cache is something we've been considering as part of #255, but ultimately won't be part of the initial solution. However, the entries we discover as part of a directory listing will be included in subsequent lookup, open, etc. requests.

Leaving this open to track as a further enhancement in the future.

@tchaton
Copy link
Author

tchaton commented Oct 13, 2023

Keep me updated on this @dannycjones

This is one of the major practical usage of mountpoint-s3 for any serious deep learning training. Here is another related issue: #554

@tchaton
Copy link
Author

tchaton commented Oct 25, 2023

@dannycjones Any updates ?

@dannycjones
Copy link
Contributor

@dannycjones Any updates ?

Not on this issue in particular. Work is on-going on #255, with metadata caching (for open, etc; but not list) available under a build-time feature flag. It isn't ready for production use yet but can be built from source for non-production testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants