Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow creation of ND2File object from large ND2 files #50

Closed
KaSaBe opened this issue May 17, 2022 · 9 comments · Fixed by #51
Closed

Slow creation of ND2File object from large ND2 files #50

KaSaBe opened this issue May 17, 2022 · 9 comments · Fixed by #51

Comments

@KaSaBe
Copy link

KaSaBe commented May 17, 2022

  • nd2 version: 0.2.2
  • Python version: 3.8.13
  • Operating System: Windows 10

Description

Thank you for creating this useful package. However, in my use case, creating ND2File objects from a ~30 GB ND2 file seems quite slow, taking about 30 s, while in comparison creating an ND2Reader object using nd2reader takes ~1 s. Reading of subsets of the data once the object is created seems similar. I would much prefer the ND2File version for the slicing interface and dask compatibility, but the initial overhead for creating a series of ND2File/dask objects is (surprisingly?) large. Using nd2.imread(filepath, dask = True) gives similar overhead, while running to_dask() on the ND2 file once created is less than a second. File details are below. Both CPU and drive usage are very moderate as the object is created, so there do not seem to be hardware bottlenecks.

What I Did

import nd2
arr = nd2.ND2File(filepath)
arr
--> <ND2File at ...: 'Time00010_Channel555 nm,635 nm_Seq0010.nd2' uint16: {'P': 30, 'Z': 61, 'C': 2, 'Y': 2304, 'X': 2304}>
@shenker
Copy link
Contributor

shenker commented May 17, 2022

I have also run into this issue.

@tlambert03 I think the issue is that there needs to be a way to specify fixup=False (and search_window) in top-level init methods like ND2File() and have that be passed down to _chunkmap.read_chunkmap. Because it is unexpected that file reading hangs on large files (it sort of defeats the purpose of lazy-loading for dask), I'd also suggest that fixup be changed to be False by default (currently it is True by default in _chunkmap.read_chunkmap and _chunkmap.read_new_chunkmap).

@tlambert03
Copy link
Owner

Yep, thanks both, I ran into this recently as well. We can definitely speed this up by not double checking the chunkmap.

@tlambert03
Copy link
Owner

(actually, @jni... this is the reason for that big delay we observed last week, it was greedily performing that chunkmap validation that I originally added to "rescue" corrupt datat)

@shenker
Copy link
Contributor

shenker commented May 17, 2022

Awesome, thanks! (Also, happy to submit a quick PR if you're busy.)

(@tlambert03: for context, this week I'm hoping to finally migrate the paulsson lab codebase over to this reader from my hacked-together pickle/memmap-enabled nd2reader fork that I've been using for the last many years...)

@tlambert03
Copy link
Owner

I'll take any chance I can get to hook you into the "contributors" column here 😂

I agree, we should make fixup default to False, but then also create a path for the ND2File.__init__ to pass the fixup (or we can name it something else like validate?) down through ._util.get_reader, into _sdk/latest.ND2Reader.__init__, and then finally back to _chunkmap.read_new_chunkmap.

@tlambert03
Copy link
Owner

@KaSaBe, @shenker's fix is on pypi now (nd2 v0.2.3). Can you try when you get a chance? If you need the conda-forge version, it will be coming a little later

@KaSaBe
Copy link
Author

KaSaBe commented May 19, 2022

Thank you for the rapid responses and fixes!

In my hands it does and does not work in v0.2.3.

With validate_frames = False, the loading speeds are very fast indeed, but in my case this unfortunately results in every single frame being "corrupted", i.e. frames being offset/overlapped incorrectly. So I now appreciate why you implemented the validation in the first place. It does seem to read the frames correctly when validate_frames = True. I did discover that depending on the source (different network drives or local) the time it takes to validate the frames varies by an order of magnitude (even if very little bandwidth is actually being used, perhaps a latency issue), so a workaround for me is to ensure that I have the files on a fast connection, then the overhead is not too severe.

P.S.: I realized now that the above also means that by changing the default behavior, nd2.imread(file, dask=True) exclusively returns corrupted frames (with my data) and does not have the option to validate them either, which could cause some confusion for users encountering this.

@tlambert03
Copy link
Owner

Ohhh, you're also reading something over the network. That helps to explain why I haven't seen this. I do have some local 30 Gb files (some of which are partially corrupted) and i had never seen long delays yet.

Yeah, it's quite unfortunate that nd2 uses these massive single files, it makes it extremely easy for a tiny byte offset error to propagate crap frames throughout the dataset.

Would you be willing to share that big dataset via Dropbox or something, so I can have a look?

@tlambert03
Copy link
Owner

Would you be willing to share that big dataset via Dropbox or something, so I can have a look?

Thanks for the files @KaSaBe! In the end, your files were totally fine (didn't need the validate_frames fix at all)... This was a latent bug here that is fixed in #54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants