subset command uses a lot of RAM when downloading large subsets #111

spiani · 2024-08-15T09:40:49Z

I continuously get an "Out of RAM" error when I run Copernicus Marine using the subset command. I was unable to find any minimal hardware requirements for the Copernicus Marine Toolbox, but after a few tests, I have the impression that the Copernicus Marine Toolbox uses an amount of RAM roughly equal to the size of the required subset.

For example, the following script goes "Out of RAM" when used on a machine with 32GB of RAM:

import copernicusmarine

MAX_DEPTH = 200

LATITUDE_RANGE = (38.894, 46.0)
LONGITUDE_RANGE = (11.346, 21.206)

my_dataset = copernicusmarine.subset(
    dataset_id = 'med-cmcc-tem-rean-d',
    minimum_longitude = LONGITUDE_RANGE[0],
    maximum_longitude = LONGITUDE_RANGE[1],
    minimum_latitude = LATITUDE_RANGE[0],
    maximum_latitude = LATITUDE_RANGE[1],
    minimum_depth = 0,
    maximum_depth = MAX_DEPTH,
    variables = ['thetao'],
    output_directory=SCRATCH_DIR,
    force_download=True
)

but it works if I reduce the domain of a factor 2. If I use the "zarr" format instead of the netcdf, I don't have any problem.

Can you confirm this? Is it possible to reduce the amount of required RAM?

I am using Copernicus Marine Toolbox version 1.3.2 on a CentOS 8 machine (kernel 4.18.0).

Thank you!

The text was updated successfully, but these errors were encountered:

veenstrajelmer · 2024-08-15T11:03:53Z

You might want to include start_datetime and end_datetime in your query to avoid retrieving the entire time range at once.

spiani · 2024-08-15T13:02:38Z

Yes, I know that I may, but I wonder: should I? I tried to download the entire dataset in one single file because I need to compute some statistical indicators on the overall time series, and it is easier to write the code for a single file rather than jumping among several ones.

Is the "subset" command intended to be used only for smaller scenarios, such as situations where "open_dataset" would be used? In my case, is it better to split the file?

veenstrajelmer · 2024-08-15T14:23:12Z

I am just another user and I do not know the official take on this. But smaller requests are often also faster. What we do is retrieve the data per period (days/months/or anything that is convenient). Then reading the data with xarray.open_mfdataset() and write it to a single file, or use the xarray dataset directly. This can probably also be done with nco.

spiani · 2024-08-19T09:20:51Z

Thank you for your reply! I think I will use the Zarr format or follow your advice and use NCO to attach several files. I usually avoid using open_mfdataset because I have noticed that the performance of Dask degrades noticeably when the number of timesteps in each file is not consistent (for example, months with 30 or 31 days). In this case, it would be better to download NetCDF files for, for example, 1,000 days, but this approach starts to become somewhat cumbersome.

veenstrajelmer · 2024-08-19T10:23:29Z

Interesting, was this reported at dask also? I can imagine this happens since the chunks are inconsistent, but instead of downloading 1000 separate days it might also work to download per month and then read the files such that the times are chunked consistently (e.g. per day). However, this might also be more cumbersome in your case than using nco. Also, it might not be best for performance either, this would depend a bit on the rest of the sizes of your dataset.

renaudjester · 2024-08-19T11:38:18Z

Hi @spiani thanks for reporting this issue! And @veenstrajelmer thanks for replying!

I haven't seen this bug before, this is definitely an issue. The idea is that the toolbox would allow you to download any amount of data with the subset. It is very interesting the difference between the zarr and netcdf format:

If I use the "zarr" format instead of the netcdf, I don't have any problem.

It could be an issue with xarray library and it's a limitation we need to take into account in the toolbox.
By any chance have you tried to apply yourself directly the xarray.Dataset.to_netcdf after downloading the data in zarr format?

renaudjester · 2024-08-20T15:54:55Z

@spiani I tried to reproduced the bug unsuccessfully unfortunately.

I tried the same request in a notebook:

And the memory seems to be stable and doesn't use that much:

I have a macbook pro 8GB of RAM. Could it be that the problem is OS specific?

veenstrajelmer · 2024-08-21T15:16:23Z

I encountered significant memory differences in xarray.open_dataset() with different backends: engine="netcdf4" sometimes consumes much more memory (but much less time) than engine="h5netcdf", but this might be in very specific cases only. Could it be that because of your environment a different engine is used? I cannot oversee what subset() does and whether this is relevant, but I thought it might be usefull to add this suggestion either way.

spiani · 2024-08-23T12:22:49Z

Hello,
Apologies for the delayed response; it's been a very busy week. I've conducted some additional experiments on my workstation (Ubuntu 22.04.4, kernel 6.8.0-40-generic, x86, 128 GB of RAM, Copernicus Marine 1.3.1 installed via Conda).

When I run the same script I mentioned in my initial comment (netcdf format), the result is as follows

INFO - 2024-08-21T09:10:32Z - Dataset version was not specified, the latest one was selected: "202012"
INFO - 2024-08-21T09:10:32Z - Dataset part was not specified, the first one was selected: "default"
INFO - 2024-08-21T09:10:33Z - Service was not specified, the default one was selected: "arco-time-series"
WARNING - 2024-08-21T09:10:34Z - Some or all of your subset selection [38.894, 46.0] for the latitude dimension  exceed the dataset coordinates [30.1875, 45.97916793823242]
INFO - 2024-08-21T09:10:34Z - Downloading using service arco-time-series...
INFO - 2024-08-21T09:10:40Z - Estimated size of the dataset file is 70062.329 MB.
INFO - 2024-08-21T09:10:40Z - Writing to local storage. Please wait...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300722/300722 [8:51:50<00:00,  9.42it/s]
INFO - 2024-08-21T18:03:01Z - Successfully downloaded to /data/temp/delete_me/med-cmcc-tem-rean-d_thetao_11.38E-21.17E_38.90N-45.98N_1.02-192.48m_1987-01-01-2022-07-31.nc

The download takes 9 hours. During execution, the process gradually increases its RAM usage. At one point, I checked when it was around 50% complete, and it was using approximately 30 GB of RAM.
If I use zarr, instead, the download takes 9 minutes!

INFO - 2024-08-22T09:22:03Z - Dataset version was not specified, the latest one was selected: "202012"
INFO - 2024-08-22T09:22:03Z - Dataset part was not specified, the first one was selected: "default"
INFO - 2024-08-22T09:22:05Z - Service was not specified, the default one was selected: "arco-time-series"
WARNING - 2024-08-22T09:22:06Z - Some or all of your subset selection [38.894, 46.0] for the latitude dimension  exceed the dataset coordinates [30.1875, 45.97916793823242]
INFO - 2024-08-22T09:22:06Z - Downloading using service arco-time-series...
INFO - 2024-08-22T09:22:11Z - Estimated size of the dataset file is 70062.329 MB.
INFO - 2024-08-22T09:22:11Z - Writing to local storage. Please wait...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300722/300722 [08:31<00:00, 587.68it/s]
INFO - 2024-08-22T09:31:09Z - Successfully downloaded to /data/temp/delete_me2/med-cmcc-tem-rean-d_thetao_11.38E-21.17E_38.90N-45.98N_1.02-192.48m_1987-01-01-2022-07-31.zarr

What’s particularly interesting to me is what happens when I limit the process to use a maximum of 16 GB of RAM using chgroup:

INFO - 2024-08-22T09:37:24Z - Dataset version was not specified, the latest one was selected: "202012"
INFO - 2024-08-22T09:37:24Z - Dataset part was not specified, the first one was selected: "default"
INFO - 2024-08-22T09:37:25Z - Service was not specified, the default one was selected: "arco-time-series"
WARNING - 2024-08-22T09:37:26Z - Some or all of your subset selection [38.894, 46.0] for the latitude dimension  exceed the dataset coordinates [30.1875, 45.97916793823242]
INFO - 2024-08-22T09:37:26Z - Downloading using service arco-time-series...
INFO - 2024-08-22T09:37:29Z - Estimated size of the dataset file is 70062.329 MB.
INFO - 2024-08-22T09:37:29Z - Writing to local storage. Please wait...
 22%|██████████████████████████████████████████████████████████                                                                                                                                                                                                                 | 65444/300722 [14:16:29<55:39:35,  1.17it/s]
Killed

The surprising part is that this process was terminated after 15 hours. So, roughly speaking, even if it hadn't run out of RAM, the download would have taken more than 3 days to complete. I’m not sure if this is because more RAM allows the kernel to use a larger cache, but this is definitely a different behavior.

Now, I will try opening the Zarr file and using xarray.Dataset.to_netcdf to see if the behavior is related to the Xarray library. If it is, we can consider opening a ticket with them.

renaudjester · 2024-08-27T07:21:30Z

Thanks for the investigation!

spiani · 2024-09-02T09:08:37Z

Hello,
I tried to save the content of the zarr file using xarray.Dataset.to_netcdf (and the default engine netcdf4). In this case, I don't see any problem. If I change the engine and I use scipy, instead, the script goes out of ram.

In any case, I think that copernicusmarine uses the default engine and therefore xarray is not the root of the problem.

Can anybody try to reproduce the problem with a linux machine? Thank you!

renaudjester · 2024-09-24T08:31:12Z

@spiani Could you indicate which version of the packages and dependencies are you using? (with a pip freeze or something similar depending on your setup)

Thanks in advance!

renaudjester · 2024-10-01T15:11:59Z

Hi @spiani and @veenstrajelmer,

I was doing some tests related to this issue. Hence, I used the same python command as the one in this issue on an Ubuntu 128GB RAM, 16 cores computer and connection: Download: 9267.84 Mbps.

First, I could indeed reproduce some time difference between saving to zarr format or netcdf format using the toolbox:

zarr format: 3min50 sec
netcdf format: 29min15s

no memory usage increase (around 30GB used by the python process the whole time)

Now I also wanted to test the to_netcdf from xarray using the zarr file I just downloaded (that is in total 20GB) and I get this:

>>> import xarray
>>> dataset = xarray.open_dataset("todelete/med-cmcc-tem-rean-d_thetao_11.38E-21.17E_38.90N-45.98N_1.02-192.48m_1987-01-01-2022-07-31.zarr", engine="zarr")
>>> dataset.to_netcdf("from_zarr.nc")
Killed

The process gets killed due to a memory problem:

FYI, @veenstrajelmer I tried both engines ("netcdf4" and "h5netcdf") and I obtain the same result: memory usage keeps increasing until it reaches 128GB and crashes.

So @spiani when you say:

I tried to save the content of the zarr file using xarray.Dataset.to_netcdf (and the default engine netcdf4). In this case, I don't see any problem.

I obtain a different result, I couldn't convert the zarr file to netcdf.

So from what I just saw here I see two things (hypothesis):

it seems from me that it's more a problem between xarray and netcdf and not a specific configuration of the toolbox.
It looks like transforming to netcdf is computationally intensive which would explain the difference in times between zarr and netcdf. Is it an acceptable difference? Not sure...

What the toolbox could do is find a workaround to those problems but not sure how to do this 🤔 For example, doing some multiprocessing (so calculating on several cores) might not be suited with everybody's infrastructure or it should be optional and I don't really know how it can be put in place (dask.distributed?)

renaudjester · 2024-10-01T15:43:38Z

After reading this stackoverflow issue I am gonna try to see if at least we can avoid the memory issues by setting some smart chunk size.

Though it seems that the time overhead induced by converting the file from zarr to netcdf is rather difficult to lower

uriii3 · 2024-10-02T08:22:19Z

I was reading it and in dask itself, they recommend also using xarray specifically... (their documentation). Not sure that if it is that problem specific or 'build' specific we can do better than xarray, no?

veenstrajelmer mentioned this issue Sep 9, 2024

Prepare 0.28.0 release Deltares/dfm_tools#993

Closed

11 tasks

This was referenced Sep 26, 2024

Prepare 0.29.0 release Deltares/dfm_tools#1010

Closed

Prepare 0.30.0 release Deltares/dfm_tools#1016

Closed

veenstrajelmer mentioned this issue Oct 20, 2024

Prepare 0.31.0 release Deltares/dfm_tools#1031

Closed

2 tasks

veenstrajelmer mentioned this issue Oct 28, 2024

Prepare 0.32.0 release Deltares/dfm_tools#1036

Open

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subset command uses a lot of RAM when downloading large subsets #111

subset command uses a lot of RAM when downloading large subsets #111

spiani commented Aug 15, 2024

veenstrajelmer commented Aug 15, 2024

spiani commented Aug 15, 2024

veenstrajelmer commented Aug 15, 2024

spiani commented Aug 19, 2024

veenstrajelmer commented Aug 19, 2024

renaudjester commented Aug 19, 2024 •

edited

Loading

renaudjester commented Aug 20, 2024

veenstrajelmer commented Aug 21, 2024

spiani commented Aug 23, 2024

renaudjester commented Aug 27, 2024

spiani commented Sep 2, 2024

renaudjester commented Sep 24, 2024 •

edited

Loading

renaudjester commented Oct 1, 2024

renaudjester commented Oct 1, 2024

uriii3 commented Oct 2, 2024 •

edited

Loading

subset command uses a lot of RAM when downloading large subsets #111

subset command uses a lot of RAM when downloading large subsets #111

Comments

spiani commented Aug 15, 2024

veenstrajelmer commented Aug 15, 2024

spiani commented Aug 15, 2024

veenstrajelmer commented Aug 15, 2024

spiani commented Aug 19, 2024

veenstrajelmer commented Aug 19, 2024

renaudjester commented Aug 19, 2024 • edited Loading

renaudjester commented Aug 20, 2024

veenstrajelmer commented Aug 21, 2024

spiani commented Aug 23, 2024

renaudjester commented Aug 27, 2024

spiani commented Sep 2, 2024

renaudjester commented Sep 24, 2024 • edited Loading

renaudjester commented Oct 1, 2024

renaudjester commented Oct 1, 2024

uriii3 commented Oct 2, 2024 • edited Loading

renaudjester commented Aug 19, 2024 •

edited

Loading

renaudjester commented Sep 24, 2024 •

edited

Loading

uriii3 commented Oct 2, 2024 •

edited

Loading