Memory spike when loading .mib data lazily #266

emichr · 2024-05-28T10:04:46Z

Describe the bug

with When loading large .mib data with hs.load("data.mib", lazy=True), RSS memory spikes more than size of dataset. For instance, when loading a 31.3 GB dataset without specifying any chunking, the RSS memory spikes at about 95 GB. With different chunking, the memory spike changes, but is still a problem (e.g. RSS spikes at 73 GB with a (64, 64, 64, 64) chunking of a (800, 320|256,256) uint16 dataset). This figure shows the RSS memory usage as a function of time when running hs.load(..., lazy=True, chunks=(64, 64, 64,64)):

To Reproduce

Steps to reproduce the behavior:

signal = hs.load("data.mib", lazy=True)

Expected behavior

The RSS memory requirement shouldn't exceed the dataset size on disk, and should be much lower when loaded lazily.

Python environement:

RosettaSciIO version: 0.4
Python version: 3.11.9
HyperSpy version: 2.1.0
Pyxem version: 0.18.0

Additional context

The problem was encountered on a HPC cluster, but I assume the problem will persist on other machines.

The text was updated successfully, but these errors were encountered:

CSSFrancis · 2024-05-28T11:14:23Z

@emichr can you try the same thing with chunks of (16,16,256,256)?

emichr · 2024-05-28T12:00:07Z

@CSSFrancis The peak got a little lower, but it is still significant (the time is also significant):

I just tested on a laptop and there are no issues. So it might be an issue with the cluster (or the environment of course). The environment .yml file from the HPC looks like this:
pyxem_hpc.txt

It might be an issue better discussed with the particular HPC infrastructure, but I guess other users might get into this issue down the line.

ericpre · 2024-05-28T12:03:27Z

Could it be that memmap isn't playing well with the distributed scheduler and the mib reader would need the same as #162?

@emichr, just to check this is using the distributed scheduler?

CSSFrancis · 2024-05-28T13:52:13Z

@emichr One thing you might try is to first load the data not using a distributed scheduler and then save it as a zarr ZipStore file. That can be read using a distributed scheduler but not written using a distributed scheduler. I tend to like to store my data as a zip store when I can (that's how hdf5 stores their data) even if it is a little bit slower it might be more convieniet in this case.

import zarr
import hyperspy.api as hs
s =hs.load("../../Downloads/FeAl_stripes.hspy", lazy=True)
s_zip2 = zarr.ZipStore("FeAl_stripes.zspy")
s.save(s_zip2)

Or you can always just zip the data after as per the note in the documentation: https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ZipStore

magnunor · 2024-05-28T15:21:02Z

Testing this on my own computer, I see exactly the same memory issue as @emichr.

Doing this uses a lot of memory:

from rsciio.quantumdetector._api import file_reader
data = file_reader("005_test.mib", lazy=True) # data is a dask array

Doing this is fast, and uses almost no memory

from rsciio.quantumdetector._api import load_mib_data
data = load_mib_data("005_test.mib", lazy=True) # data is a numpy memmap

Trying to make it into a dask array is slow and uses a lot of memory:

dask_array = da.from_array(data)

I tested this on dask version '2024.5.0'

CSSFrancis · 2024-05-28T16:29:55Z

Huh... That shouldn't be the case unless somehow the da.from_array function is converting the memmap array into a numpy array.

I wonder if this is an upstream bug in dask, @magnunor do you know if this worked previously?

magnunor · 2024-05-28T17:33:58Z

Seems to be due to a change in dask:

mamba install dask=2024.1.0 does NOT have this issue
mamba install dask=2024.1.1 does NOT have this issue
mamba install dask=2024.2.0 does have this issue
mamba install dask=2024.3.0 does have this issue
mamba install dask=2024.4.0 does have this issue

Ergo, it seems like this was introduced in dask version 2024.2.0.

emichr · 2024-05-29T05:27:55Z

Could it be that memmap isn't playing well with the distributed scheduler and the mib reader would need the same as #162?

@emichr, just to check this is using the distributed scheduler?

@ericpre I assume that I am as I haven't changed any of the defaults in hyperspy or pyxem and just use those packages "out of the box" so to speak. Not too sure how I could check this more thoroughly though, I'm not too much into dask to be frank.

magnunor · 2024-05-29T09:24:41Z

I'm making a minimal working example for posting it on the dask github, as they might know a bit more about how to resolve this.

magnunor · 2024-05-29T10:02:29Z

I made an issue about this on the dask github: dask/dask#11152

magnunor · 2024-05-29T13:14:56Z

I tested this on a Windows and a Linux computer: same issue on both.

magnunor · 2024-05-29T15:33:41Z

@sivborg found the dask commit which introduced the issue: dask/dask#11152 (comment)

So for now, a temporary fix is to downgrade dask to 2024.1.1. For Anaconda:

conda install dask=2024.1.1 -c conda-forge

ericpre · 2024-05-29T16:54:18Z

Could it be that memmap isn't playing well with the distributed scheduler and the mib reader would need the same as #162?
@emichr, just to check this is using the distributed scheduler?

@ericpre I assume that I am as I haven't changed any of the defaults in hyperspy or pyxem and just use those packages "out of the box" so to speak. Not too sure how I could check this more thoroughly though, I'm not too much into dask to be frank.

Even if this seems to be irrelevant here, for completeness, I will elaborate on my previous comment: as you mentioned that this was happening in a cluster, I assume that you were using the distributed scheduler - otherwise, it will not scale! There is briefly mentioned in the user guide but there are more details in the dask documentation.

CSSFrancis · 2024-05-29T17:06:34Z

Just adding on to that. If you wanted to load .mib files using memmap and the distributed backend you would have to adjust how the data is loaded: This part of the dask documentation describes how to do that: https://docs.dask.org/en/latest/array-creation.html?highlight=memmap#memory-mapping

ericpre · 2024-05-29T17:15:31Z

Yesterday, I had a quick go at implementing #162 for the mib reader - it was in the back of mind that this is something that need to be done. Anyway, I got stuck with the structured dtype used in the mib reader, because it messed up with chunks and shape...
@CSSFrancis, if I make a PR of my WIP on this, are you happy to continue it? Considering that you have a better understanding than me on that, you may be able to finish it more easily! 😄

CSSFrancis · 2024-05-29T17:18:50Z

@ericpre Yea I can do that (maybe not for a couple of weeks). I've done something similar in #11 with segmented detector but the file format isn't really used anymore so I kind of moved on from it.

magnunor · 2024-06-04T13:00:09Z

This seems to have been fixed in the current development version of dask: dask/dask#11161

Not sure when the next dask release will be.

ericpre · 2024-06-05T15:28:13Z

Dask usually releases on Friday every 2 weeks. The next one should be on the 14th June.

ericpre · 2024-06-05T17:29:27Z

I checked quickly and this is fixed using the main branch of dask.

emichr added the type: bug Something isn't working label May 28, 2024

magnunor mentioned this issue May 29, 2024

Large memory use when loading file with np.memmap in recent dask versions dask/dask#11152

Closed

This was referenced May 29, 2024

Add support for dask distributed scheduler in quantum detector reader #267

Merged

Improve documentation distributed scheduler hyperspy/hyperspy#3380

Merged

magnunor mentioned this issue Jun 4, 2024

Ensure tokenization of memmap doesn't materialize array in memory dask/dask#11161

Merged

1 task

ericpre closed this as completed Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory spike when loading .mib data lazily #266

Memory spike when loading .mib data lazily #266

emichr commented May 28, 2024

CSSFrancis commented May 28, 2024

emichr commented May 28, 2024

ericpre commented May 28, 2024

CSSFrancis commented May 28, 2024

magnunor commented May 28, 2024

CSSFrancis commented May 28, 2024

magnunor commented May 28, 2024

emichr commented May 29, 2024

magnunor commented May 29, 2024

magnunor commented May 29, 2024

magnunor commented May 29, 2024

magnunor commented May 29, 2024

ericpre commented May 29, 2024

CSSFrancis commented May 29, 2024

ericpre commented May 29, 2024

CSSFrancis commented May 29, 2024

magnunor commented Jun 4, 2024 •

edited

Loading

ericpre commented Jun 5, 2024

ericpre commented Jun 5, 2024

Memory spike when loading .mib data lazily #266

Memory spike when loading .mib data lazily #266

Comments

emichr commented May 28, 2024

Describe the bug

To Reproduce

Expected behavior

Python environement:

Additional context

CSSFrancis commented May 28, 2024

emichr commented May 28, 2024

ericpre commented May 28, 2024

CSSFrancis commented May 28, 2024

magnunor commented May 28, 2024

CSSFrancis commented May 28, 2024

magnunor commented May 28, 2024

emichr commented May 29, 2024

magnunor commented May 29, 2024

magnunor commented May 29, 2024

magnunor commented May 29, 2024

magnunor commented May 29, 2024

ericpre commented May 29, 2024

CSSFrancis commented May 29, 2024

ericpre commented May 29, 2024

CSSFrancis commented May 29, 2024

magnunor commented Jun 4, 2024 • edited Loading

ericpre commented Jun 5, 2024

ericpre commented Jun 5, 2024

magnunor commented Jun 4, 2024 •

edited

Loading