-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to combine multiple JSON reference files via MultiZarrToZarr() #388
Comments
Each reference file is 13GB? |
Yes, indeed. These stem from the following :
|
Obviously it would be problematic to load a whole set of JSONs into a single python process, although I am surprised by the specific error you are seeing - I have not seen it before. Are you aware of the relatively new, more efficient parquet storage format for references? It should dramatically decrease the on-disc and in-memory size of the reference sets, even during the combine phase. |
Was this a choice specifically for later kerchunking, or was there another motivation? Small chunks allow for random access to single values, but they of course mean many many more references and big reference sets, as well as worse data throughput when loading contiguous data. |
ps- I want to experiment difference initial chunking shapes. The above is my first attempt with the shape |
I am aware in the sense that I know it exists. The question is whether it is suitable for the time series I work with and whether it'll be fine to retrieve complete time series for a location. Question : how do others then crunch large time series which occupy hundreds of TBs on the disk? |
Ideally, I will end up with a chunk size of 8760 or 8784, or say 8800 to cover a year, for |
@rsignell-usgs , do you have the time to go through making a big parquet reference set?
The limiting factor for the size of the reference sets is not the total number of bytes but the total number of references, so the chunking scheme is perhaps more important here. |
My immediate reaction to solve this would be to combine them in pairs, till I finally reach the single reference file. Will this work? I hope that the error will give hints for improvements or recommendations on practices to avoid. |
I can certainly try this, as this is part of my exercise, at least with "my" data. Reminder, I work on an HPC. |
Yes, you can adopt a pair-wise tree to do combining, but the exception sounds like you cannot load any reference set into memory (I note it fails on the first file). |
@martindurant Please have a look at NikosAlexandris/rekx#3 (comment) for the complete error message. |
It doesn't strictly matter, but I would have expected the files to be presented in order, when there is an obvious order to them. In any case, even on a big HPC, I don't think you can have enough memory to process such a big reference set without the parquet route. |
How relevant is still the discussion at #240? |
#240 is about opening the datasets with zarr/xarray, not relevant here. |
The order of file(path)s ? These may be sorted if it matters, at all. They are collected via source_directory = Path(source_directory)
reference_file_paths = list(source_directory.glob(pattern))
reference_file_paths = list(map(str, reference_file_paths)) where Path is a I might get access to a |
Well, that would be the main goal then in the end, how fast we can retrieve time series for a location. |
Also, any successful output could probably not be used by anyone :) |
Of course, but you are not at that point yet |
To this end, however and from my minorest experience so far, I see that most of the time is spent reading the reference file in-memory. After that, retrieving data just happens in a snap. I wonder if :
reading the data will just happen in a fraction of a second, or else very fast and if this can stay like that for as long as the operating system runs ? |
Yes, probably! But still, to prevent you from making reference sets that are too big to handle, I really do think it should be done in parquet. |
(Answering to myself now hopefully useful for others in some way) Yes, the Parquet storage format for references is suitable for large time series like SARAH3 products which come in the form of NetCDF files and it appears to be more than fine to retrieve complete time series for a location, see also : #345 (comment). |
For my specific use case, it could be useful, provided there is a powerful machine with sufficient memory to read the large reference and be ready to support asynchronous calls for retrieving data. If this approach proves to be efficient not only in terms of speed and scalability, but also in terms of reduced electricity consumption -- are there any relevant studies in this regard ? -- then it would be worth pursuing. I am closing this issue as an item to Revisit if need be. |
I have the following JSON reference files
Trying to combine them, essentially via :
(replace self-explained variables with file paths and output filename)
in an HPC system with
and it fails raising the following error :
Any hints?
The text was updated successfully, but these errors were encountered: