Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reindexing in simple_combine #177

Merged
merged 3 commits into from
Oct 25, 2022
Merged

Support reindexing in simple_combine #177

merged 3 commits into from
Oct 25, 2022

Conversation

dcherian
Copy link
Collaborator

@dcherian dcherian commented Oct 19, 2022

  • add "grouped" or "simple" in task name and check layer names to make sure we don't regress

This PR allows reindexing inside _simple_combine so that we don't need to rerun
the groupby-reduce at every combination stage.

This means we are down to three options (good improvement!)

  1. map-reduce with reindex=True: reindex to expected_groups at blockwise stage, then map-reduce
  2. map-reduce with reindex=False: reindex at combine stage of map-reduce
  3. cohorts : start with reindex=False at blockwise, and then reindex to cohorts specifically, then tree reduce with reindex=True

For 1D combine, we see great improvements for reindex=True
reindex=False uses 2x more memory but similar time.

Note that the reindex=False intermediates are a worst case where there are
no shared groups between the chunks being combined.
This case is actually optimized in _group_combine where reindexing is
skipped for reducing along a single axis.

[ 68.75%] ··· =========== ========= =========
              --                combine
              ----------- -------------------
                  kind     grouped   combine
              =========== ========= =========
                reindexed      760M      631M
               not-reindexed    981M     1.81G
              =========== ========= =========

[ 75.00%] ··· =========== ========== ===========
              --                 combine
              ----------- ----------------------
                  kind     grouped     combine
              =========== ========== ===========
                reindexed    393±10ms    137±10ms
               not-reindexed   652±10ms   611±400ms
              =========== ========== ===========

@dcherian dcherian force-pushed the reindex-simple-combine branch from e691c3f to 9b8a8cb Compare October 19, 2022 19:41
@dcherian dcherian marked this pull request as ready for review October 19, 2022 20:03
@dcherian dcherian force-pushed the reindex-simple-combine branch 3 times, most recently from 9fa4712 to 74dcf56 Compare October 20, 2022 15:19
For 1D combine, great improvement for cohorts-type reductions
More memory but similar time for map-reduce.

Note that the map-reduce intermediates are a worst case where there are
no shared groups between the chunks being combined.
This case is actually optimized in _group_combine where reindexing is
skipped for reducing along a single axis.

[ 68.75%] ··· =========== ========= =========
              --                combine
              ----------- -------------------
                  kind     grouped   combine
              =========== ========= =========
                cohorts      760M      631M
               mapreduce     981M     1.81G
              =========== ========= =========

[ 75.00%] ··· =========== ========== ===========
              --                 combine
              ----------- ----------------------
                  kind     grouped     combine
              =========== ========== ===========
                cohorts    393±10ms    137±10ms
               mapreduce   652±10ms   611±400ms
              =========== ========== ===========

Fix bug in unique
@dcherian dcherian force-pushed the reindex-simple-combine branch from 74dcf56 to 394925e Compare October 20, 2022 16:47
@dcherian
Copy link
Collaborator Author

image

Looks like reindex=False is a major improvement when worker-saturation=inf but when worker-saturation=1.2, then it's all basically the same :(

@dcherian dcherian merged commit c2c4e1d into main Oct 25, 2022
@dcherian dcherian deleted the reindex-simple-combine branch October 25, 2022 19:41
dcherian added a commit that referenced this pull request Nov 27, 2022
commit 51fb6e9
Author: Deepak Cherian <dcherian@users.noreply.github.com>
Date:   Tue Nov 15 13:16:11 2022 -0700

    Some typing (#190)

    * Some typing

    * fixes.

commit 9b01c48
Author: Illviljan <14371165+Illviljan@users.noreply.github.com>
Date:   Sat Nov 5 09:02:07 2022 +0100

    Add windows CI (#151)

    * Add windows CI

    * Update ci.yaml

    * Update ci.yaml

    * Make arg input the same as shown in pytest

    * Add dtype check

    * [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci

    * have expected and actual results on the same side

    * use np.intp for count expected

    * [revert] minimize test

    * specify dtypes

    * more fixers

    * more.

    * Fix groupby_reduce

    * [revert] only wiindows tests

    * more fixes?

    * more fixes.

    * more fix

    * Last fix?

    * Update .github/workflows/ci.yaml

    * revert

    * Better fix

    * Revert "revert"

    This reverts commit 3b79f6e.

    * better comment.

    * clean up test

    * Revert "Revert "revert""

    This reverts commit 38438a2.

    * xfail labels dtype test

    * Revert "[revert] only wiindows tests"

    This reverts commit 232cf15.

    * Revert "[revert] minimize test"

    This reverts commit f993b31.

    * fix bad revert

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
    Co-authored-by: dcherian <deepak@cherian.net>
    Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

commit e3ea0e7
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Nov 1 08:26:30 2022 -0600

    Bump mamba-org/provision-with-micromamba from 13 to 14 (#184)

    Bumps [mamba-org/provision-with-micromamba](https://github.com/mamba-org/provision-with-micromamba) from 13 to 14.
    - [Release notes](https://github.com/mamba-org/provision-with-micromamba/releases)
    - [Commits](mamba-org/provision-with-micromamba@v13...v14)

    ---
    updated-dependencies:
    - dependency-name: mamba-org/provision-with-micromamba
      dependency-type: direct:production
      update-type: version-update:semver-major
    ...

    Signed-off-by: dependabot[bot] <support@github.com>

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit c440148
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Nov 1 08:26:17 2022 -0600

    Bump xarray-contrib/ci-trigger from 1.1 to 1.2 (#183)

    Bumps [xarray-contrib/ci-trigger](https://github.com/xarray-contrib/ci-trigger) from 1.1 to 1.2.
    - [Release notes](https://github.com/xarray-contrib/ci-trigger/releases)
    - [Commits](xarray-contrib/ci-trigger@v1.1...v1.2)

    ---
    updated-dependencies:
    - dependency-name: xarray-contrib/ci-trigger
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <support@github.com>

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 471aa94
Author: Deepak Cherian <dcherian@users.noreply.github.com>
Date:   Wed Oct 26 19:39:44 2022 -0600

    Use blockwise to extract final result (#182)

    * Use blockwise to extract final result for method="blockwise"

    * FOr all methods

    * bugfix

    * Try return_array from _finalize_results

    * Revert "Try return_array from _finalize_results"

    This reverts commit cb25e38.

    * Fixes.

commit df0da40
Author: Deepak Cherian <dcherian@users.noreply.github.com>
Date:   Tue Oct 25 16:09:04 2022 -0600

    Fix blockwise sort optimization (#181)

commit c2c4e1d
Author: Deepak Cherian <dcherian@users.noreply.github.com>
Date:   Tue Oct 25 13:41:40 2022 -0600

    Support reindexing in simple_combine (#177)

    * Support reindexing in simple_combine

    For 1D combine, great improvement for cohorts-type reductions
    More memory but similar time for map-reduce.

    Note that the map-reduce intermediates are a worst case where there are
    no shared groups between the chunks being combined.
    This case is actually optimized in _group_combine where reindexing is
    skipped for reducing along a single axis.

    [ 68.75%] ··· =========== ========= =========
                  --                combine
                  ----------- -------------------
                      kind     grouped   combine
                  =========== ========= =========
                    cohorts      760M      631M
                   mapreduce     981M     1.81G
                  =========== ========= =========

    [ 75.00%] ··· =========== ========== ===========
                  --                 combine
                  ----------- ----------------------
                      kind     grouped     combine
                  =========== ========== ===========
                    cohorts    393±10ms    137±10ms
                   mapreduce   652±10ms   611±400ms
                  =========== ========== ===========

    Fix bug in unique

    * Fix bug with all NaN blocks

commit 0db264a
Author: Deepak Cherian <dcherian@users.noreply.github.com>
Date:   Mon Oct 24 12:27:12 2022 -0600

    Update visualize for main changes (#179)

    * Update visualize for main changes

    * Update tests/__init__.py

commit ccf578d
Author: Deepak Cherian <dcherian@users.noreply.github.com>
Date:   Mon Oct 24 09:26:07 2022 -0600

    Use npg sum_of_squares (#135)

    Bump to npg >= 0.9.19

    Closes #107

commit c370b5d
Author: dcherian <deepak@cherian.net>
Date:   Thu Oct 20 10:07:36 2022 -0600

    Fix unique

commit 336e2bb
Author: dcherian <deepak@cherian.net>
Date:   Wed Oct 19 14:32:12 2022 -0600

    Remove split-reduce to reduce some test time.

commit 5431813
Author: dcherian <deepak@cherian.net>
Date:   Wed Oct 19 20:28:15 2022 -0600

    Type more utility functions

commit aa14656
Author: dcherian <deepak@cherian.net>
Date:   Wed Oct 19 20:05:56 2022 -0600

    Simplify code for blockwise

commit b1b43d9
Author: dcherian <deepak@cherian.net>
Date:   Wed Oct 19 20:01:47 2022 -0600

    Test argmax early

commit 47e0b38
Author: Deepak Cherian <dcherian@users.noreply.github.com>
Date:   Wed Oct 19 17:46:54 2022 -0600

    Force reindex to be bool always (#176)

    * Force reindex to be bool always

    Closes #155

    Turns out we weren't using the more efficient simple_combine with
    map_reduce in all cases because do_simple_combine was None when reindex
    was None.

    Now the default for map-reduce is reindex=True when (expected_groups is not None)
    or (expected_groups is None and by_is_dask is False)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant