Slow creation of ND2File object from large ND2 files #50

KaSaBe · 2022-05-17T08:29:18Z

nd2 version: 0.2.2
Python version: 3.8.13
Operating System: Windows 10

Description

Thank you for creating this useful package. However, in my use case, creating ND2File objects from a ~30 GB ND2 file seems quite slow, taking about 30 s, while in comparison creating an ND2Reader object using nd2reader takes ~1 s. Reading of subsets of the data once the object is created seems similar. I would much prefer the ND2File version for the slicing interface and dask compatibility, but the initial overhead for creating a series of ND2File/dask objects is (surprisingly?) large. Using nd2.imread(filepath, dask = True) gives similar overhead, while running to_dask() on the ND2 file once created is less than a second. File details are below. Both CPU and drive usage are very moderate as the object is created, so there do not seem to be hardware bottlenecks.

What I Did

import nd2
arr = nd2.ND2File(filepath)
arr
--> <ND2File at ...: 'Time00010_Channel555 nm,635 nm_Seq0010.nd2' uint16: {'P': 30, 'Z': 61, 'C': 2, 'Y': 2304, 'X': 2304}>

The text was updated successfully, but these errors were encountered:

shenker · 2022-05-17T13:27:38Z

I have also run into this issue.

@tlambert03 I think the issue is that there needs to be a way to specify fixup=False (and search_window) in top-level init methods like ND2File() and have that be passed down to _chunkmap.read_chunkmap. Because it is unexpected that file reading hangs on large files (it sort of defeats the purpose of lazy-loading for dask), I'd also suggest that fixup be changed to be False by default (currently it is True by default in _chunkmap.read_chunkmap and _chunkmap.read_new_chunkmap).

tlambert03 · 2022-05-17T13:28:44Z

Yep, thanks both, I ran into this recently as well. We can definitely speed this up by not double checking the chunkmap.

tlambert03 · 2022-05-17T13:31:08Z

(actually, @jni... this is the reason for that big delay we observed last week, it was greedily performing that chunkmap validation that I originally added to "rescue" corrupt datat)

shenker · 2022-05-17T13:36:50Z

Awesome, thanks! (Also, happy to submit a quick PR if you're busy.)

(@tlambert03: for context, this week I'm hoping to finally migrate the paulsson lab codebase over to this reader from my hacked-together pickle/memmap-enabled nd2reader fork that I've been using for the last many years...)

tlambert03 · 2022-05-17T14:14:02Z

I'll take any chance I can get to hook you into the "contributors" column here 😂

I agree, we should make fixup default to False, but then also create a path for the ND2File.__init__ to pass the fixup (or we can name it something else like validate?) down through ._util.get_reader, into _sdk/latest.ND2Reader.__init__, and then finally back to _chunkmap.read_new_chunkmap.

tlambert03 · 2022-05-19T00:15:16Z

@KaSaBe, @shenker's fix is on pypi now (nd2 v0.2.3). Can you try when you get a chance? If you need the conda-forge version, it will be coming a little later

KaSaBe · 2022-05-19T08:17:56Z

Thank you for the rapid responses and fixes!

In my hands it does and does not work in v0.2.3.

With validate_frames = False, the loading speeds are very fast indeed, but in my case this unfortunately results in every single frame being "corrupted", i.e. frames being offset/overlapped incorrectly. So I now appreciate why you implemented the validation in the first place. It does seem to read the frames correctly when validate_frames = True. I did discover that depending on the source (different network drives or local) the time it takes to validate the frames varies by an order of magnitude (even if very little bandwidth is actually being used, perhaps a latency issue), so a workaround for me is to ensure that I have the files on a fast connection, then the overhead is not too severe.

P.S.: I realized now that the above also means that by changing the default behavior, nd2.imread(file, dask=True) exclusively returns corrupted frames (with my data) and does not have the option to validate them either, which could cause some confusion for users encountering this.

tlambert03 · 2022-05-19T10:10:54Z

Ohhh, you're also reading something over the network. That helps to explain why I haven't seen this. I do have some local 30 Gb files (some of which are partially corrupted) and i had never seen long delays yet.

Yeah, it's quite unfortunate that nd2 uses these massive single files, it makes it extremely easy for a tiny byte offset error to propagate crap frames throughout the dataset.

Would you be willing to share that big dataset via Dropbox or something, so I can have a look?

tlambert03 · 2022-05-26T19:23:04Z

Would you be willing to share that big dataset via Dropbox or something, so I can have a look?

Thanks for the files @KaSaBe! In the end, your files were totally fine (didn't need the validate_frames fix at all)... This was a latent bug here that is fixed in #54

shenker mentioned this issue May 18, 2022

Refactor fixup #51

Merged

tlambert03 closed this as completed in #51 May 18, 2022

tlambert03 mentioned this issue May 26, 2022

Fix frame offsets when validate_frames != True, add validate_frames param to imread #54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow creation of ND2File object from large ND2 files #50

Slow creation of ND2File object from large ND2 files #50

KaSaBe commented May 17, 2022

shenker commented May 17, 2022 •

edited

Loading

tlambert03 commented May 17, 2022

tlambert03 commented May 17, 2022

shenker commented May 17, 2022

tlambert03 commented May 17, 2022

tlambert03 commented May 19, 2022

KaSaBe commented May 19, 2022 •

edited

Loading

tlambert03 commented May 19, 2022

tlambert03 commented May 26, 2022

Slow creation of ND2File object from large ND2 files #50

Slow creation of ND2File object from large ND2 files #50

Comments

KaSaBe commented May 17, 2022

Description

What I Did

shenker commented May 17, 2022 • edited Loading

tlambert03 commented May 17, 2022

tlambert03 commented May 17, 2022

shenker commented May 17, 2022

tlambert03 commented May 17, 2022

tlambert03 commented May 19, 2022

KaSaBe commented May 19, 2022 • edited Loading

tlambert03 commented May 19, 2022

tlambert03 commented May 26, 2022

shenker commented May 17, 2022 •

edited

Loading

KaSaBe commented May 19, 2022 •

edited

Loading