Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize min_count when expected_groups is not provided. #236

Merged
merged 2 commits into from
May 2, 2024

Conversation

dcherian
Copy link
Collaborator

@dcherian dcherian commented Apr 26, 2023

xref #363

Skip reindexing when expected_groups is not provided. In this case, we detect all available groups anyway.\

This is less impactful than it seems because Xarray always sets expected_groups = (pd.RangeIndex(...),).

A solution is to set min_count=0 upstream for UniqueGrouper

@dcherian dcherian force-pushed the optimize-more branch 3 times, most recently from 66010f7 to fe8d3c9 Compare April 28, 2023 15:38
@dcherian dcherian marked this pull request as draft May 1, 2023 21:39
dcherian added 2 commits May 2, 2024 07:41
For pure numpy arrays, min_count=1 (xarray default) is the same
as min_count=None, with the right fill_value. This avoids
one useless pass over the data, and one useless copy.

We need to always accumulate count with dask, to make sure we
get the right values at the end.
@dcherian dcherian marked this pull request as ready for review May 2, 2024 14:33
@dcherian dcherian changed the title Optimizations Optimize min_count when expected_groups is not provided. May 2, 2024
@dcherian dcherian merged commit 0083ab2 into main May 2, 2024
15 checks passed
@dcherian dcherian deleted the optimize-more branch May 2, 2024 14:43
dcherian added a commit that referenced this pull request May 2, 2024
* main: (64 commits)
  import `normalize_axis_index` from `numpy.lib` on `numpy>=2` (#364)
  Optimize `min_count` when `expected_groups` is not provided. (#236)
  Use threadpool for finding labels in chunk (#327)
  Manually fuse reindexing intermediates with blockwise reduction for cohorts. (#300)
  Bump codecov/codecov-action from 4.1.1 to 4.3.1 (#362)
  Add cubed notebook for hourly climatology example using "map-reduce" method (#356)
  Optimize bitmask finding for chunk size 1 and single chunk cases (#360)
  Edits to climatology doc (#361)
  Fix benchmarks (#358)
  Trim CI (#355)
  [pre-commit.ci] pre-commit autoupdate (#350)
  Initial minimal working Cubed example for "map-reduce" (#352)
  Bump codecov/codecov-action from 4.1.0 to 4.1.1 (#349)
  `method` heuristics: Avoid dot product as much as possible (#347)
  Fix nanlen with strings (#344)
  Fix direct quantile reduction (#343)
  Fix upstream-dev CI, silence warnings (#341)
  Bump codecov/codecov-action from 4.0.0 to 4.1.0 (#338)
  Fix direct reductions of Xarray objects (#339)
  Test with py3.12 (#336)
  ...
dcherian added a commit that referenced this pull request Jun 30, 2024
* main:
  Bump codecov/codecov-action from 4.3.1 to 4.4.1 (#366)
  Cubed blockwise (#357)
  Remove errant print statement
  import `normalize_axis_index` from `numpy.lib` on `numpy>=2` (#364)
  Optimize `min_count` when `expected_groups` is not provided. (#236)
  Use threadpool for finding labels in chunk (#327)
  Manually fuse reindexing intermediates with blockwise reduction for cohorts. (#300)
  Bump codecov/codecov-action from 4.1.1 to 4.3.1 (#362)
  Add cubed notebook for hourly climatology example using "map-reduce" method (#356)
  Optimize bitmask finding for chunk size 1 and single chunk cases (#360)
  Edits to climatology doc (#361)
  Fix benchmarks (#358)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant