-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize map_over_subtree #252
Comments
Good idea @Illviljan !
Do you actually need to pass it through at all? Couldn't you just do this: def map_over_subtree(func: Callable, parallel=False) -> Callable:
@functools.wraps(func)
def _map_over_subtree(*args, **kwargs) -> DataTree | Tuple[DataTree, ...]:
from .datatree import DataTree
if parallel:
import dask or ideally just do this optimization automatically (if dask is installed I guess)? I'm wondering how xarray normally does this optimization when you apply an operation to every data variable in a Dataset, for instance. Is it related to #196? |
I tried a version with parallel as an argument but it isn't passed correctly via the normal methods: Maybe we could always use this optimization. Dask usually adds some overhead though, and I just haven't played around enough to know where that threshold is or if it is significant.
I think the only place this trick is used is I don't fully understand all the changes in #196, I see that one as being able to trigger computation of all the dask arrays inside the DataArrays. My suggestion is earlier in that chain; setting up those chunked DataArrays in parallel. |
You have real datasets with 2000+ variables?!? Now that I understand that this is not about triggering computation of dask arrays but about building the dask arrays in parallel, I'm less sure that this is a good idea. I guess one way to look at it is through consistency: |
Yes, the example code is quite realistic. That's my type of datasets, and there's still always something missing...
|
Are you saying that we already do some parallelization like this within We discussed this in the xarray dev call today briefl. Stephan had a few comments, chiefly that he would be surprised if this gave significant speedup in most cases because of restrictions imposed by the GIL. Possibly once python removes the GIL we might want to revisit this question for all of xarray. |
Closed and moved upstream pydata/xarray#9502 |
I think there's some good opportunities to run
map_over_subtree
in parallel usingdask.delayed
.Consider this example data:
Here's my modded
map_over_subtree
:I'm a little unsure how to get the parallel-argument down to
map_over_subtree
though?The text was updated successfully, but these errors were encountered: