Derived variables save all data to disk before preprocessing #377

ledm · 2019-11-20T14:26:11Z

When running the derivation, the derive preprocessor saves to disk the loaded netcdf as an netcdf in preproc. This isn't ideal, as it is extremely slow to write to disk and also results it huge bloated preproc directories. Furthermore, as the data isn't realised, I don't see why we can't keep it lazy and avoid saving it to disk.

For instance, I'm running a multi-model comparison of @tillku's derived variable Ocean Heat Content (ohc.py) between the years 1960 and 2014 (or 2005 for CMIP5). For each model in my recipe, the derivation preprocessor loads a thetao variable, applies the relevant fix, then saves the data as a netcdf. Each of these is order 20GB and takes ~10 minutes on jasmin. A small run would be around 5 models, so and hour waiting and 100GB disk space. A real publishable run would be closer to 300GB and 2-3 hours.

The frustration is compounded by the fact that the preprocessor loads all the thetao files first (which all work), then tries to load the volcello files ~~(which don't work due to several issues - ask @valeriupredoi)~~ -> that is fixed by iris=3

It seems that this saving to disk is unnecessary. Is there anyway that we can avoid it? We don't need to save to disk at every preprocessing stage elsewhere. Why do we need to do it here?

This continues a conversation with @bouweandela in issue #42.

The text was updated successfully, but these errors were encountered:

bouweandela · 2019-11-28T09:38:00Z

It would be nice to improve this, but I do not have the time to do it at the moment. If you do have time, it would be great if you can work on this.

ledm added the enhancement New feature or request label Nov 20, 2019

bouweandela mentioned this issue Jun 12, 2020

Make preprocessor lazy #674

Open

62 tasks

bouweandela mentioned this issue Aug 19, 2022

Support wildcards in the recipe and improve support for ancillary variables and dataset versioning #1609

Merged

9 tasks

remi-kazeroni closed this as completed in #1609 Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Derived variables save all data to disk before preprocessing #377

Derived variables save all data to disk before preprocessing #377

ledm commented Nov 20, 2019 •

edited by valeriupredoi

Loading

bouweandela commented Nov 28, 2019

Derived variables save all data to disk before preprocessing #377

Derived variables save all data to disk before preprocessing #377

Comments

ledm commented Nov 20, 2019 • edited by valeriupredoi Loading

bouweandela commented Nov 28, 2019

ledm commented Nov 20, 2019 •

edited by valeriupredoi

Loading