-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
subset command uses a lot of RAM when downloading large subsets #111
Comments
You might want to include |
Yes, I know that I may, but I wonder: should I? I tried to download the entire dataset in one single file because I need to compute some statistical indicators on the overall time series, and it is easier to write the code for a single file rather than jumping among several ones. Is the "subset" command intended to be used only for smaller scenarios, such as situations where "open_dataset" would be used? In my case, is it better to split the file? |
I am just another user and I do not know the official take on this. But smaller requests are often also faster. What we do is retrieve the data per period (days/months/or anything that is convenient). Then reading the data with |
Thank you for your reply! I think I will use the Zarr format or follow your advice and use NCO to attach several files. I usually avoid using |
Interesting, was this reported at dask also? I can imagine this happens since the chunks are inconsistent, but instead of downloading 1000 separate days it might also work to download per month and then read the files such that the times are chunked consistently (e.g. per day). However, this might also be more cumbersome in your case than using nco. Also, it might not be best for performance either, this would depend a bit on the rest of the sizes of your dataset. |
Hi @spiani thanks for reporting this issue! And @veenstrajelmer thanks for replying! I haven't seen this bug before, this is definitely an issue. The idea is that the toolbox would allow you to download any amount of data with the subset. It is very interesting the difference between the
It could be an issue with |
@spiani I tried to reproduced the bug unsuccessfully unfortunately. I tried the same request in a notebook: And the memory seems to be stable and doesn't use that much: I have a macbook pro 8GB of RAM. Could it be that the problem is OS specific? |
I encountered significant memory differences in |
Hello, When I run the same script I mentioned in my initial comment (netcdf format), the result is as follows
The download takes 9 hours. During execution, the process gradually increases its RAM usage. At one point, I checked when it was around 50% complete, and it was using approximately 30 GB of RAM.
What’s particularly interesting to me is what happens when I limit the process to use a maximum of 16 GB of RAM using chgroup:
The surprising part is that this process was terminated after 15 hours. So, roughly speaking, even if it hadn't run out of RAM, the download would have taken more than 3 days to complete. I’m not sure if this is because more RAM allows the kernel to use a larger cache, but this is definitely a different behavior. Now, I will try opening the Zarr file and using |
Thanks for the investigation! |
Hello, In any case, I think that copernicusmarine uses the default engine and therefore xarray is not the root of the problem. Can anybody try to reproduce the problem with a linux machine? Thank you! |
@spiani Could you indicate which version of the packages and dependencies are you using? (with a Thanks in advance! |
Hi @spiani and @veenstrajelmer, I was doing some tests related to this issue. Hence, I used the same python command as the one in this issue on an Ubuntu 128GB RAM, 16 cores computer and connection: Download: 9267.84 Mbps. First, I could indeed reproduce some time difference between saving to
no memory usage increase (around 30GB used by the python process the whole time) Now I also wanted to test the
The process gets killed due to a memory problem: FYI, @veenstrajelmer I tried both engines ("netcdf4" and "h5netcdf") and I obtain the same result: memory usage keeps increasing until it reaches 128GB and crashes. So @spiani when you say:
I obtain a different result, I couldn't convert the zarr file to netcdf. So from what I just saw here I see two things (hypothesis):
What the toolbox could do is find a workaround to those problems but not sure how to do this 🤔 For example, doing some multiprocessing (so calculating on several cores) might not be suited with everybody's infrastructure or it should be optional and I don't really know how it can be put in place (dask.distributed?) |
After reading this stackoverflow issue I am gonna try to see if at least we can avoid the memory issues by setting some smart chunk size. Though it seems that the time overhead induced by converting the file from zarr to netcdf is rather difficult to lower |
I was reading it and in dask itself, they recommend also using xarray specifically... (their documentation). Not sure that if it is that problem specific or 'build' specific we can do better than xarray, no? |
I continuously get an "Out of RAM" error when I run Copernicus Marine using the subset command. I was unable to find any minimal hardware requirements for the Copernicus Marine Toolbox, but after a few tests, I have the impression that the Copernicus Marine Toolbox uses an amount of RAM roughly equal to the size of the required subset.
For example, the following script goes "Out of RAM" when used on a machine with 32GB of RAM:
but it works if I reduce the domain of a factor 2. If I use the "zarr" format instead of the netcdf, I don't have any problem.
Can you confirm this? Is it possible to reduce the amount of required RAM?
I am using Copernicus Marine Toolbox version 1.3.2 on a CentOS 8 machine (kernel 4.18.0).
Thank you!
The text was updated successfully, but these errors were encountered: