Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display Python code for loading objects #100

Closed
magland opened this issue Aug 7, 2023 · 9 comments
Closed

Display Python code for loading objects #100

magland opened this issue Aug 7, 2023 · 9 comments

Comments

@magland
Copy link
Collaborator

magland commented Aug 7, 2023

No description provided.

@magland
Copy link
Collaborator Author

magland commented Aug 7, 2023

@bendichter @CodyCBakerPhD

I think this will be pretty useful for exploring the data in more detail. When you have a neurodata object open, you can click to view a Python code snippet to access that object remotely in a Python environment.

image

I have found that the recommended fsspec method for loading the remote file is very inefficient compared with direct http requests using range headers (I don't know the inner-workings of that library). This prompted me to create a simple package called remfile. You can look at the README for that project for a more detailed discussion. The implementation is lightweight (a single Python file).

Also, I have found pynwb to be inefficient for quick access to remote files - I think it's loading a lot of metadata up front. For that reason, I am using h5py directly for now. Happy to discuss further.

@CodyCBakerPhD
Copy link
Contributor

There are still 2 big advantages to using fsspec as I understand (correct me if wrong)

a) fsspec has automatic retry for failed requests; this was especially convenient as we discovered in the NWB Inspector when it was used to generate quality reports for all datasets on DANDI via h5py ROS3 streaming, which does not have automatic retry. About ~3% of requests, even for simple metadata fields, fail and the AWS recommendation was to retry with an exponential backoff

See https://github.com/NeurodataWithoutBorders/nwbinspector/blob/6e4771f3008233a3a9e79ac919b2d4a0ae3d2f6c/src/nwbinspector/utils.py#L160-L174 for our implementation at the time, but we've intended to switch to fsspec just out of convenience for not having to deal with that ourselves

b) The CachingFileSystem from fsspec is persistent on disk - I can stream contents from one file, close the file, restart a process, go to another process, etc. and the cache doesn't reset, I can still benefit from not having to redownload the desired bytes

As I understand, the RemFile cache is only in-memory and specific to the object instance and will reset/cleanup on close or kernel restart, rather like the native Python LRU cache. Maybe the term 'cache' is overloaded here

I admit this is only useful in some cases where revisiting data or metadata is common, not for the case of quickly scanning through an entire dataset once and only once, but still

@bendichter
Copy link
Contributor

@magland would it be possible to add caching to disk and retries?

@magland
Copy link
Collaborator Author

magland commented Aug 8, 2023

Thanks @CodyCBakerPhD

@bendichter yes, I have already added retries, and included a test for it (see below). Caching to disk is a bit more tricky... I will take a crack at it and you can tell me what you think.

https://github.com/magland/remfile/blob/28a907d6e4cd1785ea12491ab12fe23724fae6ae/remfile/RemFile.py#L196-L224

https://github.com/magland/remfile/blob/28a907d6e4cd1785ea12491ab12fe23724fae6ae/tests/test_main.py#L53-L62

@bendichter
Copy link
Contributor

You can try joblib.Memory for caching

@CodyCBakerPhD
Copy link
Contributor

Looks like a downside to joblib.Memory is it behaves more like a classic LRU cache in that it only records the mapping unique to exact input arguments rather than smartly figuring out what remaining bytes are needed

i.e., if I request a slice of [0:10] from a dataset, then a following request for [0:20]

I'm not 100% sure w.r.t. a paginated file, or how the chunking equates to byte ranges under the hood, but I don't think it would re-use the data already downloaded from the first slice/request since it might equate to a different combination of range arguments

I'd reference the fsspec implementation of the memory map cache: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/caching.py#L95

But yes, it does seem more than a bit tricky to get it working comparably well. I'd be fine with just recommending RemFile for fast non-cached direct access to contents and recommending fsspec for caching use cases if it's not something you want to spend too much time perfecting

@magland
Copy link
Collaborator Author

magland commented Aug 8, 2023

Yeah, I think this is not straightforward to solve. For now let's say remfile does not have disk caching capability.

@magland
Copy link
Collaborator Author

magland commented Aug 8, 2023

@CodyCBakerPhD Looking at this some more, the difficult part of an LRU disk cache is the LRU. So I decided to implement a non-LRU disk cache, and it was pretty straightforward. See:

https://github.com/magland/remfile#disk-caching

@bendichter

@CodyCBakerPhD
Copy link
Contributor

@magland I think this discussion can be closed for now; we're working on higher level benchmarking and enhanced documentation/instructions for streaming recommendations and will open a new issue if/when we reach new recommendations for code snippets to follow

@magland magland closed this as completed Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants