Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use drop_operation_state to avoid stack overflows #1004

Merged
merged 6 commits into from
Dec 12, 2023

Conversation

aurianer
Copy link
Collaborator

@aurianer aurianer commented Oct 6, 2023

Will need the pika 0.19.1 patch release (it will probably be released next week), in the meantime I'm temporarily using pika@main in CI
Attempt to fix #1005. Fixes #665.

@aurianer aurianer self-assigned this Oct 6, 2023
@aurianer aurianer force-pushed the use_drop_operation_state branch from 08dd214 to d3a6452 Compare October 6, 2023 13:51
@aurianer
Copy link
Collaborator Author

aurianer commented Oct 6, 2023

cscs-ci run

@aurianer aurianer force-pushed the use_drop_operation_state branch from d3a6452 to cb788ba Compare October 6, 2023 14:02
@aurianer aurianer changed the title Use drop operation state Use drop_operation_state to avoid stack overflows Oct 6, 2023
@msimberg
Copy link
Collaborator

msimberg commented Oct 9, 2023

cscs-ci run

1 similar comment
@aurianer
Copy link
Collaborator Author

aurianer commented Oct 9, 2023

cscs-ci run

@aurianer aurianer marked this pull request as ready for review October 10, 2023 09:17
@aurianer
Copy link
Collaborator Author

cscs-ci run

@aurianer aurianer force-pushed the use_drop_operation_state branch from cb788ba to acf6364 Compare October 10, 2023 09:45
@aurianer
Copy link
Collaborator Author

cscs-ci run

Copy link
Collaborator

@rasolca rasolca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR still depend on pika@main. I think it should be updated to pika@0.19.1 before considering it ready.

CMakeLists.txt Outdated Show resolved Hide resolved
spack/packages/dla-future/package.py Outdated Show resolved Hide resolved
@rasolca rasolca added Type:Bug Something isn't working Priority:High labels Oct 10, 2023
Copy link
Collaborator

@msimberg msimberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before approving I'd like to hear if you were able to reproduce the bug locally and if the previously failing tests have been run enough times in CI? It was after all the band_to_tridiag miniapp that was failing that revealed the stack overflows, and now it was the C API eigensolver tests...

Copy link
Collaborator

@msimberg msimberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs a bump to the spack commit. It's currently installing pika@main.

@aurianer
Copy link
Collaborator Author

Before approving I'd like to hear if you were able to reproduce the bug locally and if the previously failing tests have been run enough times in CI? It was after all the band_to_tridiag miniapp that was failing that revealed the stack overflows, and now it was the C API eigensolver tests...

So concerning the stack overflow in the band_to_tridiag miniapp, it's solved with this PR. Concerning the failures that happened in CI with stdexec in Debug mode (the ones that @rasolca reported on in slack), I couldn't reproduce them in my environment on daint. I will now try to use the same jfrog container used in this CI job to see if I have more luck. I'll keep you updated

@aurianer aurianer force-pushed the use_drop_operation_state branch from acf6364 to d92c8e6 Compare October 10, 2023 14:14
@aurianer
Copy link
Collaborator Author

So I tried to reproduce the failure happening with the C API tests using the exact same sarus image and I was unable to. I let it run by night, 70 runs happened successfully. And I'm not sure why but my ssh connection dropped in the middle of the 71st run. I would say, I first publish the benchmarks for this PR, we then merge it and I will investigate again in case the failure is happening again :)

@aurianer
Copy link
Collaborator Author

aurianer commented Dec 7, 2023

I posted the benchmarks on confluence, the performance is very similar, there is a slight regression for the tridiagonal solver but it appears to be minimal and this PR fixes the segfault we have seen in test_band_to_tridiag. I will rebase on master and after it's good to go on my side, let me know what you think after looking at the benchmarks.

@aurianer aurianer force-pushed the use_drop_operation_state branch from d92c8e6 to 3c50016 Compare December 7, 2023 17:24
@aurianer
Copy link
Collaborator Author

aurianer commented Dec 7, 2023

cscs-ci run

@msimberg
Copy link
Collaborator

msimberg commented Dec 8, 2023

Looks like there's a fun new linker(!) error on the latest rebuild: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5705139557#L1220. Perhaps we try:

  • Rebase on/merge latest DLA-Future master
  • Updating spack for a newer pika (0.21.0 from the other day)
  • If the above doesn't work, try bumping the stdexec commit to the tag of 23.09.rc4 (I think we may be waiting in vain for the non-rc tag)
  • If these don't change anything, we'll need to investigate further...

The changes look good to me otherwise though, and the benchmarks also look good enough!

Copy link
Collaborator

@msimberg msimberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving changes, but CI build failure needs to be fixed/worked around.

@aurianer
Copy link
Collaborator Author

aurianer commented Dec 8, 2023

cscs-ci run

@aurianer aurianer force-pushed the use_drop_operation_state branch from 7e61555 to dd9c398 Compare December 8, 2023 10:17
@aurianer
Copy link
Collaborator Author

aurianer commented Dec 8, 2023

cscs-ci run

@aurianer aurianer force-pushed the use_drop_operation_state branch from dd9c398 to e6c39ea Compare December 11, 2023 10:49
@aurianer
Copy link
Collaborator Author

cscs-ci run

@msimberg
Copy link
Collaborator

cscs-ci run

@rasolca rasolca merged commit 3c70c1d into eth-cscs:master Dec 12, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority:High Type:Bug Something isn't working
Projects
Archived in project
3 participants