Alternative dask-powered histogram algorithm using xarray.groupby and numpy_groupies #60

TomNicholas · 2021-06-10T04:10:42Z

After @shoyer mentioned earlier today that he had an example of dealing with with the ND-histogram problem in xarray by using xarray.apply_ufunc and numpy_groupies, I made this notebook to try it out for creating histograms in xarray.

The basic idea is that da.groupby_bins(bins).apply(count) essentially creates a histogram, and numpy_groupies can speed up the groupby_bins hugely.

I think its pretty cool that it even works, but you'll see in the notebook that I don't think the performance compares favourably with xhistogram's dask.blockwise implementation (see #49), though I didn't manage to get numba-powered groupies working yet. The dask task graphs are also not as nice.

@rabernat this is the sort of thing I had in mind originally.

@gjoseph92 you might find this interesting as an alternate solution to your blockwise one.

@dcherian and @max-sixty you might find this example interesting as I know you've been working on using numpy_groupies in pydata/xarray#4473 .

The text was updated successfully, but these errors were encountered:

dcherian · 2021-06-10T15:17:03Z

Thanks @TomNicholas

Well this looks familiar :). If you start solving all the questions in your notebook, you'll probably end up with something like https://github.com/dcherian/dask_groupby/blob/ec6f13400ab8ccc9269099076a31b44354e8ecf6/dask_groupby/core.py#L167

In any case it's hard to do better (in terms of complexity) than bincount because it handles weights too (also see https://jakevdp.github.io/blog/2017/03/22/group-by-from-scratch/)

What you could do in xhistogram is apply_ufunc the function you pass to blockwise and then call sum on the result (maybe this works, haven't tried it); that will take care of all the annoying bookkeeping around broadcasting etc. We'll need a new flag for apply_ufunc that allows chunked core dimensions (dask="blockwise" maybe, See pydata/xarray#1995 under crusader's "proposal 2")

( I spent a lot of time thinking about this, so we can chat some time if you want )

TomNicholas · 2021-06-10T16:07:21Z

you'll probably end up with something like

Wow a lot of work has gone into that @dcherian !

What you could do in xhistogram is apply_ufunc the function you pass to blockwise and then call sum on the result

I think you could do this, but the effect would be similar to Ryan's reshape logic. I think using apply_ufunc like that would probably be clearer but slower. I'll try to write pydata/xarray#5400 in such a way that we could drop either solution in though.

We'll need a new flag for apply_ufunc that allows chunked core dimensions (dask="blockwise" maybe, See pydata/xarray#1995 under crusader's "proposal 2")

Can't we just use allow_rechunk=True? That's what I did in this notebook - see my comment on the same issue.

dcherian · 2021-06-10T16:50:28Z

Can't we just use allow_rechunk=True

no that's potentially so expensive it takes down the cluster. It's better to have an algorithm that knows how to deal with the chunks like #49

TomNicholas changed the title ~~Alternative dask-power histogram algorithm using xarray.groupby and numpy_groupies~~ Alternative dask-powered histogram algorithm using xarray.groupby and numpy_groupies Jun 10, 2021

TomNicholas mentioned this issue Jun 10, 2021

Add histogram method pydata/xarray#4610

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative dask-powered histogram algorithm using xarray.groupby and numpy_groupies #60

Alternative dask-powered histogram algorithm using xarray.groupby and numpy_groupies #60

TomNicholas commented Jun 10, 2021 •

edited

Loading

dcherian commented Jun 10, 2021 •

edited

Loading

TomNicholas commented Jun 10, 2021

dcherian commented Jun 10, 2021

Alternative dask-powered histogram algorithm using xarray.groupby and numpy_groupies #60

Alternative dask-powered histogram algorithm using xarray.groupby and numpy_groupies #60

Comments

TomNicholas commented Jun 10, 2021 • edited Loading

dcherian commented Jun 10, 2021 • edited Loading

TomNicholas commented Jun 10, 2021

dcherian commented Jun 10, 2021

TomNicholas commented Jun 10, 2021 •

edited

Loading

dcherian commented Jun 10, 2021 •

edited

Loading