-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reinterpolate to all elements for new target points (fix BBH deadlocks) #5531
Conversation
// elements that contained any of the previous target points. However, | ||
// these elements may now contain some new target points. | ||
if (vars_infos.count(temporal_id) == 0) { | ||
vars_infos.emplace(std::make_pair( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vars_infos.insert_or_assign
? (Haven't checked exactly what type this is, but most similar classes have it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm let me think if we need to keep around the other members of the Info
class when we receive a new set of points. I'm thinking no.
// Add the new target interpolation points at this temporal_id. If we | ||
// already have some target points at this temporal_id, we overwrite | ||
// them with the new target points. This is because the interpolation | ||
// has already finished for the previous set of target points if we | ||
// are receiving new target points. This core must not have had any | ||
// elements that contained any of the previous target points. However, | ||
// these elements may now contain some new target points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this happen in the opposite order? We receive the second set of points (which need interpolation) first, and then overwrite them with the first set (which didn't)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe that could technically happen in the opposite order. I wonder if we should somehow send along the horizon find iteration so we know which set of points is the correct one (always the higher iteration)
ec07609
to
42fca67
Compare
@wthrowe Pushed a fixup that covers the scenario of receiving points in the wrong order. |
That should do it. Go ahead and squash. |
42fca67
to
a8af8b0
Compare
if (vars_infos.count(temporal_id) == 0 or | ||
vars_infos.at(temporal_id).iteration < iteration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make it error if the new iteration is the same as the old iteration. This is a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wthrowe Pushed a fixup for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
68928c3
to
e0db394
Compare
@wthrowe Squashed |
I've been testing this branch for a few different head on BBH's (different resolutions, enabling checkpoints and quiet/debug output) and they've all successfully reached a common horizon :). So I would say this has fixed the current deadlocks we've been encountering. These runs are on Ocean at |
Hoorayyy 🙌 |
This is believed to be what was causing a lot of deadlocks in BBH simulations.
e0db394
to
62a7326
Compare
@wthrowe can you please take a look at this again? |
It's waiting on @markscheel? |
? |
I removed Mark because this is a pretty important bug fix that I think we should get in quickly. |
If the PR author requests a review, I'm going to respect that. |
That doesn't mean you can't give feedback post-squash, though. |
Proposed changes
This is believed to be what was causing a lot of deadlocks in BBH simulations. Between horizon find iterations, a target point of the new horizon surface would move from one element on one core, to a different element on a second core. Because of some implicit assumption made about message ordering (which wasn't consciously made, but just overlooked), this target point never got interpolated to. Thus the target was waiting for the interpolated data from this point while the interpolator was waiting for its next set of points and a deadlock happened.
Fixes #5487.
Upgrade instructions
Code review checklist
make doc
to generate the documentation locally intoBUILD_DIR/docs/html
.Then open
index.html
.code review guide.
bugfix
ornew feature
if appropriate.Further comments