Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not realize data and use dask in derivation functions #42

Closed
3 of 4 tasks
mattiarighi opened this issue Mar 8, 2019 · 13 comments
Closed
3 of 4 tasks

Do not realize data and use dask in derivation functions #42

mattiarighi opened this issue Mar 8, 2019 · 13 comments
Assignees
Labels
enhancement New feature or request preprocessor Related to the preprocessor

Comments

@mattiarighi
Copy link
Contributor

mattiarighi commented Mar 8, 2019

As reported by @bouweandela, many derivation functions still realize the data and use numpy instead of dask. This is detrimental for the performance and should be changed.

Affected variables:

  • amoc
  • gtfgco2 should not be needed anymore, there is a preprocessor function for this
  • sm
  • toz
@mattiarighi mattiarighi changed the title Do not realize data and use dask in derivation function Do not realize data and use dask in derivation functions Mar 8, 2019
@bouweandela
Copy link
Member

introduction to lazy data: https://scitools.org.uk/iris/docs/latest/userguide/real_and_lazy_data.html

If you need array functions to do things, use from dask import array as da instead of import numpy as np, see here for a description of dask array options: http://docs.dask.org/en/latest/array-api.html

@bouweandela
Copy link
Member

See here for an example on multiplying a cube with a number without realizing data:
https://github.com/ESMValGroup/ESMValTool/blob/822941f52780dde2b0b122a9a8a99f23e313ef30/esmvaltool/cmor/_fixes/CMIP5/BNU_ESM.py#L154

And here for an example on using dask arrays:
https://github.com/ESMValGroup/ESMValTool/blob/822941f52780dde2b0b122a9a8a99f23e313ef30/esmvaltool/cmor/_fixes/CMIP5/BNU_ESM.py#L177-L178

@valeriupredoi
Copy link
Contributor

just want to say, I ❤️ this thread 😁

@mattiarighi
Copy link
Contributor Author

Great! Then it's yours 👍

@valeriupredoi
Copy link
Contributor

yay, more crap for me!
FYI to us all, at the meeting with the iris folk last week I asked if they could explicitly say which iris funcs realize or keep the data lazy and Corinne has already started working on this (very important) info: SciTools/iris#3292

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Mar 11, 2019

As reported by @bouweandela, many derivation functions still realize the data and use numpy instead of dask. This is detrimental for the performance and should be changed.

Affected variables:

* [] `amoc`
  • this should be fine, no actual accessing of the data member
* [] `gtfgco2`
  • this one needs to: remove the data access and mask construction and remove the building of a list of numpy arrays (bad worlf!)
* [] `sm`
  • just the last bit that builds the mask from the data
* [] `toz`
  • total mess, how do you set dtype to a dask array?

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Mar 14, 2019

@zklaus would you be too angry with me if I asked you to look at this? I have a metric ton of crap that I need to take care of and I feel I am going to sideline this - plus you are working closely with the iris stuff anyways. Beer from me when we next meet 🍺

@mattiarighi mattiarighi transferred this issue from ESMValGroup/ESMValTool Jun 11, 2019
@mattiarighi mattiarighi added preprocessor Related to the preprocessor enhancement New feature or request labels Jun 11, 2019
@ledm
Copy link
Contributor

ledm commented Nov 14, 2019

Similar to this discussion, is there a way to switch off writing the derived variables to disk? It seems to slow everything down and shouldn't be necessary.

@bouweandela
Copy link
Member

You probably mean the input variables needed to derive a variable? In that case the answer is no.

@ledm
Copy link
Contributor

ledm commented Nov 15, 2019

Is it not possible to load the cubes into dask arrays, instead of saving them?

In the case of the derivation of OHC, it loads a 4D variable and saves it exactly as it is. It basically copies 20GB of data into the working directory for each dataset before doing any calculations! All I want is a scalar field, it should be a few kb!

Furthermore, we only have 100GB space in our home directories on jasmin, this means that there's only space for a few models using this method. (I will move my working directory somewhere with more space, but this still doesn't seem like a great method!)

@bouweandela
Copy link
Member

Is it not possible to load the cubes into dask arrays, instead of saving them?

Maybe in the future, but not at the moment. Do you feel like implementing this yourself?

Furthermore, we only have 100GB space in our home directories on jasmin, this means that there's only space for a few models using this method. (I will move my working directory somewhere with more space, but this still doesn't seem like a great method!)

The Jasmin user guide recommends using a group workspace for storing large amounts of data: https://help.jasmin.ac.uk/article/176-storage, not your home directory. I started on pull request #265. which will make it possible to store preprocessor and other temporary data on a special temporary file system, but this is not ready yet.

@ledm
Copy link
Contributor

ledm commented May 29, 2020

Just a comment that gtfgco2 may still be needed. I've commented on the merged PR here #418 (comment)

but happy to continue the discussion here if needed.

@bouweandela
Copy link
Member

Up-to-date overview and discussion in #2451.

@bouweandela bouweandela closed this as not planned Won't fix, can't repro, duplicate, stale Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request preprocessor Related to the preprocessor
Projects
Development

No branches or pull requests

5 participants