-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use drop_operation_state
to avoid stack overflows
#1004
Conversation
08dd214
to
d3a6452
Compare
cscs-ci run |
d3a6452
to
cb788ba
Compare
drop_operation_state
to avoid stack overflows
cscs-ci run |
1 similar comment
cscs-ci run |
cscs-ci run |
cb788ba
to
acf6364
Compare
cscs-ci run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR still depend on pika@main
. I think it should be updated to pika@0.19.1
before considering it ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before approving I'd like to hear if you were able to reproduce the bug locally and if the previously failing tests have been run enough times in CI? It was after all the band_to_tridiag
miniapp that was failing that revealed the stack overflows, and now it was the C API eigensolver tests...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also needs a bump to the spack commit. It's currently installing pika@main
.
So concerning the stack overflow in the |
acf6364
to
d92c8e6
Compare
So I tried to reproduce the failure happening with the C API tests using the exact same sarus image and I was unable to. I let it run by night, 70 runs happened successfully. And I'm not sure why but my ssh connection dropped in the middle of the 71st run. I would say, I first publish the benchmarks for this PR, we then merge it and I will investigate again in case the failure is happening again :) |
I posted the benchmarks on confluence, the performance is very similar, there is a slight regression for the tridiagonal solver but it appears to be minimal and this PR fixes the segfault we have seen in |
d92c8e6
to
3c50016
Compare
cscs-ci run |
Looks like there's a fun new linker(!) error on the latest rebuild: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5705139557#L1220. Perhaps we try:
The changes look good to me otherwise though, and the benchmarks also look good enough! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving changes, but CI build failure needs to be fixed/worked around.
cscs-ci run |
7e61555
to
dd9c398
Compare
cscs-ci run |
dd9c398
to
e6c39ea
Compare
cscs-ci run |
cscs-ci run |
Will need the pika 0.19.1 patch release (it will probably be released next week), in the meantime I'm temporarily using pika@main in CI
Attempt to fix #1005. Fixes #665.