Failure to combine multiple JSON reference files via MultiZarrToZarr() #388

NikosAlexandris · 2023-11-03T18:34:08Z

I have the following JSON reference files

13G Nov  2 10:07 sarah3_sid_reference_1999.json
13G Nov  2 09:58 sarah3_sid_reference_2000.json
13G Nov  2 11:00 sarah3_sid_reference_2001.json
13G Nov  2 11:08 sarah3_sid_reference_2002.json
13G Nov  2 12:04 sarah3_sid_reference_2003.json
13G Nov  2 12:12 sarah3_sid_reference_2004.json
13G Nov  2 13:07 sarah3_sid_reference_2005.json
13G Nov  2 14:29 sarah3_sid_reference_2006.json
13G Nov  2 15:27 sarah3_sid_reference_2007.json
13G Nov  2 16:45 sarah3_sid_reference_2008.json
13G Nov  2 17:43 sarah3_sid_reference_2009.json
13G Nov  2 19:02 sarah3_sid_reference_2010.json
13G Nov  2 19:58 sarah3_sid_reference_2011.json
13G Nov  2 21:25 sarah3_sid_reference_2012.json
13G Nov  2 22:13 sarah3_sid_reference_2013.json
13G Nov  2 23:43 sarah3_sid_reference_2014.json
13G Nov  3 00:36 sarah3_sid_reference_2015.json
13G Nov  3 02:03 sarah3_sid_reference_2016.json
13G Nov  3 02:58 sarah3_sid_reference_2017.json
13G Nov  3 04:24 sarah3_sid_reference_2018.json
13G Nov  3 05:21 sarah3_sid_reference_2019.json
13G Nov  3 06:48 sarah3_sid_reference_2020.json
13G Nov  3 07:41 sarah3_sid_reference_2021.json

Trying to combine them, essentially via :

        from kerchunk.combine import MultiZarrToZarr
        mzz = MultiZarrToZarr(
            reference_file_paths,
            concat_dims=['time'],
            identical_dims=['lat', 'lon'],
        )
        multifile_kerchunk = mzz.translate()

        combined_reference_filename = Path(combined_reference)
        local_fs = fsspec.filesystem('file')
        with local_fs.open(combined_reference_filename, 'wb') as f:
            f.write(ujson.dumps(multifile_kerchunk).encode())

(replace self-explained variables with file paths and output filename)
in an HPC system with

❯ free -hm
              total        used        free      shared  buff/cache   available
Mem:          503Gi       4.7Gi       495Gi       2.8Gi       3.1Gi       494Gi
Swap:            0B          0B          0B

and it fails raising the following error :

│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │         cache_size = 128                                                                     │ │
│ │                dic = {'protocol': None}                                                      │ │
│ │                  f = <fsspec.implementations.local.LocalFileOpener object at 0x148d7e531270> │ │
│ │                 fo = '/project/home/p200206/data/sarah3_sid_reference_1999.json'             │ │
│ │                fo2 = '/project/home/p200206/data/sarah3_sid_reference_1999.json'             │ │
│ │                 fs = None                                                                    │ │
│ │             kwargs = {}                                                                      │ │
│ │          max_block = 256000000                                                               │ │
│ │            max_gap = 64000                                                                   │ │
│ │             ref_fs = <fsspec.implementations.local.LocalFileSystem object at 0x14f59fd1db10> │ │
│ │   ref_storage_args = None                                                                    │ │
│ │     remote_options = {}                                                                      │ │
│ │    remote_protocol = None                                                                    │ │
│ │               self = <fsspec.implementations.reference.ReferenceFileSystem object at         │ │
│ │                      0x148d7e531120>                                                         │ │
│ │   simple_templates = True                                                                    │ │
│ │             target = None                                                                    │ │
│ │     target_options = None                                                                    │ │
│ │    target_protocol = None                                                                    │ │
│ │ template_overrides = None                                                                    │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
JSONDecodeError: Could not reserve memory block

Any hints?

martindurant · 2023-11-03T18:38:02Z

Each reference file is 13GB?

NikosAlexandris · 2023-11-03T19:29:44Z

Each reference file is 13GB?

Yes, indeed. These stem from the following :

Daily NetCDF files from 1999 to 2021 (actually, I miss a year, so up to 2022) that make up about about 1.2T
- Each daily NetCDF file contains 48 half-hourly maps of 2600 x 2600 pixels
- Noted here : mixed chunking shapes between years (e.g. time, lat, lon : 1 x 2600 x 2600, 1 x 1300 x 1300, and maybe more)
The daily NetCDF files are rechunked 1 x 32 x 32, these are now 1.25T
A first set of JSON reference files (one reference file per rechunked input NetCDF file) is about ~377G.
A second step of (24 should be in total) yearly JSON reference files (based on the first reference set) is ~300G
Finally, the goal is to create a single reference file to cover the complete time series

martindurant · 2023-11-03T19:33:15Z

Obviously it would be problematic to load a whole set of JSONs into a single python process, although I am surprised by the specific error you are seeing - I have not seen it before.

Are you aware of the relatively new, more efficient parquet storage format for references? It should dramatically decrease the on-disc and in-memory size of the reference sets, even during the combine phase.

martindurant · 2023-11-03T19:35:57Z

The daily NetCDF files are rechunked 1 x 32 x 32

Was this a choice specifically for later kerchunking, or was there another motivation? Small chunks allow for random access to single values, but they of course mean many many more references and big reference sets, as well as worse data throughput when loading contiguous data.

NikosAlexandris · 2023-11-03T19:37:37Z

ps- I want to experiment difference initial chunking shapes. The above is my first attempt with the shape 1 x 32 x 32 suggested at "Chunking Data: Choosing Shapes", by Russ Rew (see my implementation : suggest.py which will go to https://github.com/NikosAlexandris/rekx).

NikosAlexandris · 2023-11-03T19:39:25Z

Obviously it would be problematic to load a whole set of JSONs into a single python process, although I am surprised by the specific error you are seeing - I have not seen it before.

Are you aware of the relatively new, more efficient parquet storage format for references? It should dramatically decrease the on-disc and in-memory size of the reference sets, even during the combine phase.

I am aware in the sense that I know it exists. The question is whether it is suitable for the time series I work with and whether it'll be fine to retrieve complete time series for a location. Question : how do others then crunch large time series which occupy hundreds of TBs on the disk?

NikosAlexandris · 2023-11-03T19:43:48Z

Ideally, I will end up with a chunk size of 8760 or 8784, or say 8800 to cover a year, for time and then some reasonable chunk size for the spatial dimensions lat and lon. I'd try to favor retrieval of yearly data or the complete time series for a single pixel location (as this is the most common pattern for an application I work on). But I thought of testing various shapes.

martindurant · 2023-11-03T19:44:39Z

@rsignell-usgs , do you have the time to go through making a big parquet reference set?

how do others then crunch large time series which occupy hundreds of TBs on the disk

The limiting factor for the size of the reference sets is not the total number of bytes but the total number of references, so the chunking scheme is perhaps more important here.

NikosAlexandris · 2023-11-03T19:45:26Z

Obviously it would be problematic to load a whole set of JSONs into a single python process, although I am surprised by the specific error you are seeing - I have not seen it before.

My immediate reaction to solve this would be to combine them in pairs, till I finally reach the single reference file. Will this work? I hope that the error will give hints for improvements or recommendations on practices to avoid.

NikosAlexandris · 2023-11-03T19:46:52Z

@rsignell-usgs , do you have the time to go through making a big parquet reference set?

I can certainly try this, as this is part of my exercise, at least with "my" data. Reminder, I work on an HPC.

martindurant · 2023-11-03T19:47:03Z

Yes, you can adopt a pair-wise tree to do combining, but the exception sounds like you cannot load any reference set into memory (I note it fails on the first file).

NikosAlexandris · 2023-11-03T19:50:29Z

(I note it fails on the first file).

@martindurant Please have a look at NikosAlexandris/rekx#3 (comment) for the complete error message.

martindurant · 2023-11-03T19:56:09Z

It doesn't strictly matter, but I would have expected the files to be presented in order, when there is an obvious order to them. In any case, even on a big HPC, I don't think you can have enough memory to process such a big reference set without the parquet route.

NikosAlexandris · 2023-11-03T20:02:37Z

How relevant is still the discussion at #240?

martindurant · 2023-11-03T20:04:27Z

#240 is about opening the datasets with zarr/xarray, not relevant here.

NikosAlexandris · 2023-11-03T20:12:46Z

It doesn't strictly matter, but I would have expected the files to be presented in order, when there is an obvious order to them. In any case, even on a big HPC, I don't think you can have enough memory to process such a big reference set without the parquet route.

The order of file(path)s ? These may be sorted if it matters, at all. They are collected via

source_directory = Path(source_directory)
reference_file_paths = list(source_directory.glob(pattern))
reference_file_paths = list(map(str, reference_file_paths))

where Path is a pathlib.Path object.

I might get access to a largemem partition with 4 TB of memory. The problem would be then that a successful process there would not work on system's without such a large memory capacity.

NikosAlexandris · 2023-11-03T20:14:25Z

#240 is about opening the datasets with zarr/xarray, not relevant here.

Well, that would be the main goal then in the end, how fast we can retrieve time series for a location.

martindurant · 2023-11-03T20:15:06Z

The problem would be then that a successful process there would not work on system's without such a large memory capacity.

Also, any successful output could probably not be used by anyone :)
Critically, the parquet storage is not only more efficient, but also partitioned and lazy, so users only read references they need as they need them.

martindurant · 2023-11-03T20:15:48Z

#240 is about opening the datasets with zarr/xarray, not relevant here.

Well, that would be the main goal then in the end

Of course, but you are not at that point yet

NikosAlexandris · 2023-11-03T20:41:38Z

#240 is about opening the datasets with zarr/xarray, not relevant here.

Well, that would be the main goal then in the end

Of course, but you are not at that point yet

To this end, however and from my minorest experience so far, I see that most of the time is spent reading the reference file in-memory. After that, retrieving data just happens in a snap. I wonder if :

having a system with enough RAM
a decent chunking shape for large time series like the example in this discussion that can serve various access patterns
the data loaded in-memory, even if it takes several minutes

reading the data will just happen in a fraction of a second, or else very fast and if this can stay like that for as long as the operating system runs ?

martindurant · 2023-11-03T20:46:17Z

Yes, probably! But still, to prevent you from making reference sets that are too big to handle, I really do think it should be done in parquet.

NikosAlexandris · 2023-11-10T11:50:19Z

I am aware in the sense that I know it exists. The question is whether it is suitable for the time series I work with and whether it'll be fine to retrieve complete time series for a location.

(Answering to myself now hopefully useful for others in some way)

Yes, the Parquet storage format for references is suitable for large time series like SARAH3 products which come in the form of NetCDF files and it appears to be more than fine to retrieve complete time series for a location, see also : #345 (comment).

NikosAlexandris · 2023-11-10T11:55:28Z

The problem would be then that a successful process there would not work on system's without such a large memory capacity.

Also, any successful output could probably not be used by anyone :) Critically, the parquet storage is not only more efficient, but also partitioned and lazy, so users only read references they need as they need them.

For my specific use case, it could be useful, provided there is a powerful machine with sufficient memory to read the large reference and be ready to support asynchronous calls for retrieving data. If this approach proves to be efficient not only in terms of speed and scalability, but also in terms of reduced electricity consumption -- are there any relevant studies in this regard ? -- then it would be worth pursuing.

I am closing this issue as an item to Revisit if need be.

NikosAlexandris closed this as completed Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to combine multiple JSON reference files via MultiZarrToZarr() #388

Failure to combine multiple JSON reference files via MultiZarrToZarr() #388

NikosAlexandris commented Nov 3, 2023 •

edited

Loading

martindurant commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

martindurant commented Nov 3, 2023

martindurant commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023 •

edited

Loading

NikosAlexandris commented Nov 3, 2023

martindurant commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

martindurant commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

martindurant commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

martindurant commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

martindurant commented Nov 3, 2023

martindurant commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

martindurant commented Nov 3, 2023

NikosAlexandris commented Nov 10, 2023

NikosAlexandris commented Nov 10, 2023

Failure to combine multiple JSON reference files via MultiZarrToZarr() #388

Failure to combine multiple JSON reference files via MultiZarrToZarr() #388

Comments

NikosAlexandris commented Nov 3, 2023 • edited Loading

martindurant commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

martindurant commented Nov 3, 2023

martindurant commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023 • edited Loading

NikosAlexandris commented Nov 3, 2023

martindurant commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

martindurant commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

martindurant commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

martindurant commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

martindurant commented Nov 3, 2023

martindurant commented Nov 3, 2023

NikosAlexandris commented Nov 3, 2023

martindurant commented Nov 3, 2023

NikosAlexandris commented Nov 10, 2023

NikosAlexandris commented Nov 10, 2023

NikosAlexandris commented Nov 3, 2023 •

edited

Loading

NikosAlexandris commented Nov 3, 2023 •

edited

Loading