You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running the derivation, the derive preprocessor saves to disk the loaded netcdf as an netcdf in preproc. This isn't ideal, as it is extremely slow to write to disk and also results it huge bloated preproc directories. Furthermore, as the data isn't realised, I don't see why we can't keep it lazy and avoid saving it to disk.
For instance, I'm running a multi-model comparison of @tillku's derived variable Ocean Heat Content (ohc.py) between the years 1960 and 2014 (or 2005 for CMIP5). For each model in my recipe, the derivation preprocessor loads a thetao variable, applies the relevant fix, then saves the data as a netcdf. Each of these is order 20GB and takes ~10 minutes on jasmin. A small run would be around 5 models, so and hour waiting and 100GB disk space. A real publishable run would be closer to 300GB and 2-3 hours.
The frustration is compounded by the fact that the preprocessor loads all the thetao files first (which all work), then tries to load the volcello files (which don't work due to several issues - ask @valeriupredoi) -> that is fixed by iris=3
It seems that this saving to disk is unnecessary. Is there anyway that we can avoid it? We don't need to save to disk at every preprocessing stage elsewhere. Why do we need to do it here?
This continues a conversation with @bouweandela in issue #42.
The text was updated successfully, but these errors were encountered:
When running the derivation, the derive preprocessor saves to disk the loaded netcdf as an netcdf in preproc. This isn't ideal, as it is extremely slow to write to disk and also results it huge bloated preproc directories. Furthermore, as the data isn't realised, I don't see why we can't keep it lazy and avoid saving it to disk.
For instance, I'm running a multi-model comparison of @tillku's derived variable Ocean Heat Content (ohc.py) between the years 1960 and 2014 (or 2005 for CMIP5). For each model in my recipe, the derivation preprocessor loads a
thetao
variable, applies the relevant fix, then saves the data as a netcdf. Each of these is order 20GB and takes ~10 minutes on jasmin. A small run would be around 5 models, so and hour waiting and 100GB disk space. A real publishable run would be closer to 300GB and 2-3 hours.The frustration is compounded by the fact that the preprocessor loads all the
thetao
files first (which all work), then tries to load thevolcello
files(which don't work due to several issues - ask @valeriupredoi)-> that is fixed by iris=3It seems that this saving to disk is unnecessary. Is there anyway that we can avoid it? We don't need to save to disk at every preprocessing stage elsewhere. Why do we need to do it here?
This continues a conversation with @bouweandela in issue #42.
The text was updated successfully, but these errors were encountered: