Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative dask-powered histogram algorithm using xarray.groupby and numpy_groupies #60

Open
TomNicholas opened this issue Jun 10, 2021 · 3 comments

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Jun 10, 2021

After @shoyer mentioned earlier today that he had an example of dealing with with the ND-histogram problem in xarray by using xarray.apply_ufunc and numpy_groupies, I made this notebook to try it out for creating histograms in xarray.

The basic idea is that da.groupby_bins(bins).apply(count) essentially creates a histogram, and numpy_groupies can speed up the groupby_bins hugely.

I think its pretty cool that it even works, but you'll see in the notebook that I don't think the performance compares favourably with xhistogram's dask.blockwise implementation (see #49), though I didn't manage to get numba-powered groupies working yet. The dask task graphs are also not as nice.

@rabernat this is the sort of thing I had in mind originally.

@gjoseph92 you might find this interesting as an alternate solution to your blockwise one.

@dcherian and @max-sixty you might find this example interesting as I know you've been working on using numpy_groupies in pydata/xarray#4473 .

@TomNicholas TomNicholas changed the title Alternative dask-power histogram algorithm using xarray.groupby and numpy_groupies Alternative dask-powered histogram algorithm using xarray.groupby and numpy_groupies Jun 10, 2021
@dcherian
Copy link

dcherian commented Jun 10, 2021

Thanks @TomNicholas

Well this looks familiar :). If you start solving all the questions in your notebook, you'll probably end up with something like https://github.com/dcherian/dask_groupby/blob/ec6f13400ab8ccc9269099076a31b44354e8ecf6/dask_groupby/core.py#L167

In any case it's hard to do better (in terms of complexity) than bincount because it handles weights too (also see https://jakevdp.github.io/blog/2017/03/22/group-by-from-scratch/)

What you could do in xhistogram is apply_ufunc the function you pass to blockwise and then call sum on the result (maybe this works, haven't tried it); that will take care of all the annoying bookkeeping around broadcasting etc. We'll need a new flag for apply_ufunc that allows chunked core dimensions (dask="blockwise" maybe, See pydata/xarray#1995 under crusader's "proposal 2")

( I spent a lot of time thinking about this, so we can chat some time if you want )

@TomNicholas
Copy link
Member Author

you'll probably end up with something like

Wow a lot of work has gone into that @dcherian !

What you could do in xhistogram is apply_ufunc the function you pass to blockwise and then call sum on the result

I think you could do this, but the effect would be similar to Ryan's reshape logic. I think using apply_ufunc like that would probably be clearer but slower. I'll try to write pydata/xarray#5400 in such a way that we could drop either solution in though.

We'll need a new flag for apply_ufunc that allows chunked core dimensions (dask="blockwise" maybe, See pydata/xarray#1995 under crusader's "proposal 2")

Can't we just use allow_rechunk=True? That's what I did in this notebook - see my comment on the same issue.

@dcherian
Copy link

Can't we just use allow_rechunk=True

no that's potentially so expensive it takes down the cluster. It's better to have an algorithm that knows how to deal with the chunks like #49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants