Display Python code for loading objects #100

magland · 2023-08-07T14:09:26Z

No description provided.

magland · 2023-08-07T16:48:34Z

I think this will be pretty useful for exploring the data in more detail. When you have a neurodata object open, you can click to view a Python code snippet to access that object remotely in a Python environment.

I have found that the recommended fsspec method for loading the remote file is very inefficient compared with direct http requests using range headers (I don't know the inner-workings of that library). This prompted me to create a simple package called remfile. You can look at the README for that project for a more detailed discussion. The implementation is lightweight (a single Python file).

Also, I have found pynwb to be inefficient for quick access to remote files - I think it's loading a lot of metadata up front. For that reason, I am using h5py directly for now. Happy to discuss further.

CodyCBakerPhD · 2023-08-08T14:59:29Z

There are still 2 big advantages to using fsspec as I understand (correct me if wrong)

a) fsspec has automatic retry for failed requests; this was especially convenient as we discovered in the NWB Inspector when it was used to generate quality reports for all datasets on DANDI via h5py ROS3 streaming, which does not have automatic retry. About ~3% of requests, even for simple metadata fields, fail and the AWS recommendation was to retry with an exponential backoff

See https://github.com/NeurodataWithoutBorders/nwbinspector/blob/6e4771f3008233a3a9e79ac919b2d4a0ae3d2f6c/src/nwbinspector/utils.py#L160-L174 for our implementation at the time, but we've intended to switch to fsspec just out of convenience for not having to deal with that ourselves

b) The CachingFileSystem from fsspec is persistent on disk - I can stream contents from one file, close the file, restart a process, go to another process, etc. and the cache doesn't reset, I can still benefit from not having to redownload the desired bytes

As I understand, the RemFile cache is only in-memory and specific to the object instance and will reset/cleanup on close or kernel restart, rather like the native Python LRU cache. Maybe the term 'cache' is overloaded here

I admit this is only useful in some cases where revisiting data or metadata is common, not for the case of quickly scanning through an entire dataset once and only once, but still

bendichter · 2023-08-08T17:05:40Z

@magland would it be possible to add caching to disk and retries?

magland · 2023-08-08T17:32:09Z

Thanks @CodyCBakerPhD

@bendichter yes, I have already added retries, and included a test for it (see below). Caching to disk is a bit more tricky... I will take a crack at it and you can tell me what you think.

https://github.com/magland/remfile/blob/28a907d6e4cd1785ea12491ab12fe23724fae6ae/remfile/RemFile.py#L196-L224

https://github.com/magland/remfile/blob/28a907d6e4cd1785ea12491ab12fe23724fae6ae/tests/test_main.py#L53-L62

bendichter · 2023-08-08T17:34:50Z

You can try joblib.Memory for caching

CodyCBakerPhD · 2023-08-08T17:49:35Z

Looks like a downside to joblib.Memory is it behaves more like a classic LRU cache in that it only records the mapping unique to exact input arguments rather than smartly figuring out what remaining bytes are needed

i.e., if I request a slice of [0:10] from a dataset, then a following request for [0:20]

I'm not 100% sure w.r.t. a paginated file, or how the chunking equates to byte ranges under the hood, but I don't think it would re-use the data already downloaded from the first slice/request since it might equate to a different combination of range arguments

I'd reference the fsspec implementation of the memory map cache: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/caching.py#L95

But yes, it does seem more than a bit tricky to get it working comparably well. I'd be fine with just recommending RemFile for fast non-cached direct access to contents and recommending fsspec for caching use cases if it's not something you want to spend too much time perfecting

magland · 2023-08-08T18:17:47Z

Yeah, I think this is not straightforward to solve. For now let's say remfile does not have disk caching capability.

magland · 2023-08-08T18:55:31Z

@CodyCBakerPhD Looking at this some more, the difficult part of an LRU disk cache is the LRU. So I decided to implement a non-LRU disk cache, and it was pretty straightforward. See:

https://github.com/magland/remfile#disk-caching

@bendichter

CodyCBakerPhD · 2024-02-08T21:31:11Z

@magland I think this discussion can be closed for now; we're working on higher level benchmarking and enhanced documentation/instructions for streaming recommendations and will open a new issue if/when we reach new recommendations for code snippets to follow

CodyCBakerPhD mentioned this issue Oct 14, 2023

Is it thread safe, in particular with caching? magland/remfile#7

Open

magland closed this as completed Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Display Python code for loading objects #100

Display Python code for loading objects #100

magland commented Aug 7, 2023

magland commented Aug 7, 2023

CodyCBakerPhD commented Aug 8, 2023

bendichter commented Aug 8, 2023

magland commented Aug 8, 2023

bendichter commented Aug 8, 2023

CodyCBakerPhD commented Aug 8, 2023

magland commented Aug 8, 2023

magland commented Aug 8, 2023

CodyCBakerPhD commented Feb 8, 2024

Display Python code for loading objects #100

Display Python code for loading objects #100

Comments

magland commented Aug 7, 2023

magland commented Aug 7, 2023

CodyCBakerPhD commented Aug 8, 2023

bendichter commented Aug 8, 2023

magland commented Aug 8, 2023

bendichter commented Aug 8, 2023

CodyCBakerPhD commented Aug 8, 2023

magland commented Aug 8, 2023

magland commented Aug 8, 2023

CodyCBakerPhD commented Feb 8, 2024