Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to combine multiple JSON reference files via MultiZarrToZarr() #388

Closed
NikosAlexandris opened this issue Nov 3, 2023 · 23 comments
Closed

Comments

@NikosAlexandris
Copy link

NikosAlexandris commented Nov 3, 2023

I have the following JSON reference files

13G Nov  2 10:07 sarah3_sid_reference_1999.json
13G Nov  2 09:58 sarah3_sid_reference_2000.json
13G Nov  2 11:00 sarah3_sid_reference_2001.json
13G Nov  2 11:08 sarah3_sid_reference_2002.json
13G Nov  2 12:04 sarah3_sid_reference_2003.json
13G Nov  2 12:12 sarah3_sid_reference_2004.json
13G Nov  2 13:07 sarah3_sid_reference_2005.json
13G Nov  2 14:29 sarah3_sid_reference_2006.json
13G Nov  2 15:27 sarah3_sid_reference_2007.json
13G Nov  2 16:45 sarah3_sid_reference_2008.json
13G Nov  2 17:43 sarah3_sid_reference_2009.json
13G Nov  2 19:02 sarah3_sid_reference_2010.json
13G Nov  2 19:58 sarah3_sid_reference_2011.json
13G Nov  2 21:25 sarah3_sid_reference_2012.json
13G Nov  2 22:13 sarah3_sid_reference_2013.json
13G Nov  2 23:43 sarah3_sid_reference_2014.json
13G Nov  3 00:36 sarah3_sid_reference_2015.json
13G Nov  3 02:03 sarah3_sid_reference_2016.json
13G Nov  3 02:58 sarah3_sid_reference_2017.json
13G Nov  3 04:24 sarah3_sid_reference_2018.json
13G Nov  3 05:21 sarah3_sid_reference_2019.json
13G Nov  3 06:48 sarah3_sid_reference_2020.json
13G Nov  3 07:41 sarah3_sid_reference_2021.json

Trying to combine them, essentially via :

        from kerchunk.combine import MultiZarrToZarr
        mzz = MultiZarrToZarr(
            reference_file_paths,
            concat_dims=['time'],
            identical_dims=['lat', 'lon'],
        )
        multifile_kerchunk = mzz.translate()

        combined_reference_filename = Path(combined_reference)
        local_fs = fsspec.filesystem('file')
        with local_fs.open(combined_reference_filename, 'wb') as f:
            f.write(ujson.dumps(multifile_kerchunk).encode())

(replace self-explained variables with file paths and output filename)
in an HPC system with

❯ free -hm
              total        used        free      shared  buff/cache   available
Mem:          503Gi       4.7Gi       495Gi       2.8Gi       3.1Gi       494Gi
Swap:            0B          0B          0B

and it fails raising the following error :

│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │         cache_size = 128                                                                     │ │
│ │                dic = {'protocol': None}                                                      │ │
│ │                  f = <fsspec.implementations.local.LocalFileOpener object at 0x148d7e531270> │ │
│ │                 fo = '/project/home/p200206/data/sarah3_sid_reference_1999.json'             │ │
│ │                fo2 = '/project/home/p200206/data/sarah3_sid_reference_1999.json'             │ │
│ │                 fs = None                                                                    │ │
│ │             kwargs = {}                                                                      │ │
│ │          max_block = 256000000                                                               │ │
│ │            max_gap = 64000                                                                   │ │
│ │             ref_fs = <fsspec.implementations.local.LocalFileSystem object at 0x14f59fd1db10> │ │
│ │   ref_storage_args = None                                                                    │ │
│ │     remote_options = {}                                                                      │ │
│ │    remote_protocol = None                                                                    │ │
│ │               self = <fsspec.implementations.reference.ReferenceFileSystem object at         │ │
│ │                      0x148d7e531120>                                                         │ │
│ │   simple_templates = True                                                                    │ │
│ │             target = None                                                                    │ │
│ │     target_options = None                                                                    │ │
│ │    target_protocol = None                                                                    │ │
│ │ template_overrides = None                                                                    │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
JSONDecodeError: Could not reserve memory block

Any hints?

@martindurant
Copy link
Member

Each reference file is 13GB?

@NikosAlexandris
Copy link
Author

Each reference file is 13GB?

Yes, indeed. These stem from the following :

  • Daily NetCDF files from 1999 to 2021 (actually, I miss a year, so up to 2022) that make up about about 1.2T
    • Each daily NetCDF file contains 48 half-hourly maps of 2600 x 2600 pixels
    • Noted here : mixed chunking shapes between years (e.g. time, lat, lon : 1 x 2600 x 2600, 1 x 1300 x 1300, and maybe more)
  • The daily NetCDF files are rechunked 1 x 32 x 32, these are now 1.25T
  • A first set of JSON reference files (one reference file per rechunked input NetCDF file) is about ~377G.
  • A second step of (24 should be in total) yearly JSON reference files (based on the first reference set) is ~300G
  • Finally, the goal is to create a single reference file to cover the complete time series

@martindurant
Copy link
Member

Obviously it would be problematic to load a whole set of JSONs into a single python process, although I am surprised by the specific error you are seeing - I have not seen it before.

Are you aware of the relatively new, more efficient parquet storage format for references? It should dramatically decrease the on-disc and in-memory size of the reference sets, even during the combine phase.

@martindurant
Copy link
Member

The daily NetCDF files are rechunked 1 x 32 x 32

Was this a choice specifically for later kerchunking, or was there another motivation? Small chunks allow for random access to single values, but they of course mean many many more references and big reference sets, as well as worse data throughput when loading contiguous data.

@NikosAlexandris
Copy link
Author

ps- I want to experiment difference initial chunking shapes. The above is my first attempt with the shape 1 x 32 x 32 suggested at "Chunking Data: Choosing Shapes", by Russ Rew (see my implementation : suggest.py which will go to https://github.com/NikosAlexandris/rekx).

@NikosAlexandris
Copy link
Author

NikosAlexandris commented Nov 3, 2023

Obviously it would be problematic to load a whole set of JSONs into a single python process, although I am surprised by the specific error you are seeing - I have not seen it before.

Are you aware of the relatively new, more efficient parquet storage format for references? It should dramatically decrease the on-disc and in-memory size of the reference sets, even during the combine phase.

I am aware in the sense that I know it exists. The question is whether it is suitable for the time series I work with and whether it'll be fine to retrieve complete time series for a location. Question : how do others then crunch large time series which occupy hundreds of TBs on the disk?

@NikosAlexandris
Copy link
Author

Ideally, I will end up with a chunk size of 8760 or 8784, or say 8800 to cover a year, for time and then some reasonable chunk size for the spatial dimensions lat and lon. I'd try to favor retrieval of yearly data or the complete time series for a single pixel location (as this is the most common pattern for an application I work on). But I thought of testing various shapes.

@martindurant
Copy link
Member

@rsignell-usgs , do you have the time to go through making a big parquet reference set?

how do others then crunch large time series which occupy hundreds of TBs on the disk

The limiting factor for the size of the reference sets is not the total number of bytes but the total number of references, so the chunking scheme is perhaps more important here.

@NikosAlexandris
Copy link
Author

Obviously it would be problematic to load a whole set of JSONs into a single python process, although I am surprised by the specific error you are seeing - I have not seen it before.

My immediate reaction to solve this would be to combine them in pairs, till I finally reach the single reference file. Will this work? I hope that the error will give hints for improvements or recommendations on practices to avoid.

@NikosAlexandris
Copy link
Author

@rsignell-usgs , do you have the time to go through making a big parquet reference set?

I can certainly try this, as this is part of my exercise, at least with "my" data. Reminder, I work on an HPC.

@martindurant
Copy link
Member

Yes, you can adopt a pair-wise tree to do combining, but the exception sounds like you cannot load any reference set into memory (I note it fails on the first file).

@NikosAlexandris
Copy link
Author

(I note it fails on the first file).

@martindurant Please have a look at NikosAlexandris/rekx#3 (comment) for the complete error message.

@martindurant
Copy link
Member

It doesn't strictly matter, but I would have expected the files to be presented in order, when there is an obvious order to them. In any case, even on a big HPC, I don't think you can have enough memory to process such a big reference set without the parquet route.

@NikosAlexandris
Copy link
Author

How relevant is still the discussion at #240?

@martindurant
Copy link
Member

#240 is about opening the datasets with zarr/xarray, not relevant here.

@NikosAlexandris
Copy link
Author

It doesn't strictly matter, but I would have expected the files to be presented in order, when there is an obvious order to them. In any case, even on a big HPC, I don't think you can have enough memory to process such a big reference set without the parquet route.

The order of file(path)s ? These may be sorted if it matters, at all. They are collected via

source_directory = Path(source_directory)
reference_file_paths = list(source_directory.glob(pattern))
reference_file_paths = list(map(str, reference_file_paths))

where Path is a pathlib.Path object.

I might get access to a largemem partition with 4 TB of memory. The problem would be then that a successful process there would not work on system's without such a large memory capacity.

@NikosAlexandris
Copy link
Author

#240 is about opening the datasets with zarr/xarray, not relevant here.

Well, that would be the main goal then in the end, how fast we can retrieve time series for a location.

@martindurant
Copy link
Member

The problem would be then that a successful process there would not work on system's without such a large memory capacity.

Also, any successful output could probably not be used by anyone :)
Critically, the parquet storage is not only more efficient, but also partitioned and lazy, so users only read references they need as they need them.

@martindurant
Copy link
Member

#240 is about opening the datasets with zarr/xarray, not relevant here.

Well, that would be the main goal then in the end

Of course, but you are not at that point yet

@NikosAlexandris
Copy link
Author

#240 is about opening the datasets with zarr/xarray, not relevant here.

Well, that would be the main goal then in the end

Of course, but you are not at that point yet

To this end, however and from my minorest experience so far, I see that most of the time is spent reading the reference file in-memory. After that, retrieving data just happens in a snap. I wonder if :

  • having a system with enough RAM
  • a decent chunking shape for large time series like the example in this discussion that can serve various access patterns
  • the data loaded in-memory, even if it takes several minutes

reading the data will just happen in a fraction of a second, or else very fast and if this can stay like that for as long as the operating system runs ?

@martindurant
Copy link
Member

Yes, probably! But still, to prevent you from making reference sets that are too big to handle, I really do think it should be done in parquet.

@NikosAlexandris
Copy link
Author

I am aware in the sense that I know it exists. The question is whether it is suitable for the time series I work with and whether it'll be fine to retrieve complete time series for a location.

(Answering to myself now hopefully useful for others in some way)

Yes, the Parquet storage format for references is suitable for large time series like SARAH3 products which come in the form of NetCDF files and it appears to be more than fine to retrieve complete time series for a location, see also : #345 (comment).

@NikosAlexandris
Copy link
Author

The problem would be then that a successful process there would not work on system's without such a large memory capacity.

Also, any successful output could probably not be used by anyone :) Critically, the parquet storage is not only more efficient, but also partitioned and lazy, so users only read references they need as they need them.

For my specific use case, it could be useful, provided there is a powerful machine with sufficient memory to read the large reference and be ready to support asynchronous calls for retrieving data. If this approach proves to be efficient not only in terms of speed and scalability, but also in terms of reduced electricity consumption -- are there any relevant studies in this regard ? -- then it would be worth pursuing.

I am closing this issue as an item to Revisit if need be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants