Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(iroh-net): Clear the recent pong time when pong is lost #2743

Merged
merged 1 commit into from
Sep 24, 2024

Conversation

flub
Copy link
Contributor

@flub flub commented Sep 23, 2024

Description

When we do not receive a pong in the timeframe we should have, we remove the last ping time and clear the best_addr. However the path will be re-selected from the candidates as long as the pong time is set.

This additionally clears the pong time so that there is no longer an indication this path works. While the DISCO ping-pong is not loss-protected there is already an existing check that ensures this clear is not triggered when there are still other signs of life from the remote node.

Breaking Changes

None

Notes & open questions

Note that when sending call-me-maybe the pong times are also cleared. So while I'm not a particular fan of this, it is the intended way to use this. Ideally the PathState would be clearer about this data and store this as separately. But the PathState data needs a general cleanup and I do not want to mix in extra state for now.

Fixes a bug found by @zh522130

Change checklist

  • Self-review.
  • [ ] Documentation updates following the style guide, if relevant.
  • [ ] Tests if relevant.
  • [ ] All breaking changes documented.

When we do not receive a pong in the timeframe we should have we
remove the last ping time and clear the best_addr.  However the path
will be re-selected from the candidates as long as the pong time is
set.

This additionally clears the pong time so that there is no longer an
indication this path works.  While the DISCO ping-pong is not
loss-protected there is already an existing check that ensures this
clear is not triggered when there are still other signs of life from
the remote node.

Note that when sending call-me-maybe the pong times are also cleared.
So while I'm not a particular fan of this this is the intended way to
use this.  Ideally the pathstate would be clearer about this data and
store this as separately.  But the PathState data needs a general
cleanup and I do not want to mix in extra state for now.
Copy link

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/2743/docs/iroh/

Last updated: 2024-09-23T15:23:47Z

@zh522130
Copy link
Contributor

@flub Thank you for resolving this issue so quickly. I was able to reproduce the problem using the old version where, after stopping the server and restarting it after a while, the client could no longer connect to the server. After switching to the current branch for testing, regardless of when the server is restarted, the client can successfully connect to the server within a few to a dozen seconds. Both local area network (LAN) and wide area network (WAN) tests have shown that this fix has resolved my issue.

@flub flub marked this pull request as ready for review September 24, 2024 08:48
Copy link
Contributor

@matheus23 matheus23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, IIUC, the recent_pong setting interacts with assign_best_addr_from_candidates_if_empty. So previously it wasn't set to None, which meant that clearing the best_addr didn't have the desired effect, since it was re-set to a bad address in assign_best_addr_from_candidates_if_empty, right?

@flub
Copy link
Contributor Author

flub commented Sep 24, 2024

Okay, IIUC, the recent_pong setting interacts with assign_best_addr_from_candidates_if_empty. So previously it wasn't set to None, which meant that clearing the best_addr didn't have the desired effect, since it was re-set to a bad address in assign_best_addr_from_candidates_if_empty, right?

correct

@flub flub added this pull request to the merge queue Sep 24, 2024
Merged via the queue into main with commit 8fb92f3 Sep 24, 2024
31 checks passed
@divagant-martian
Copy link
Contributor

late to the party I guess. Docs for recent_pong still state

    /// Last [`PongReply`] received.

which is now factually wrong.

The field's naming is good enough to allow for this, but docs should get adjusted accordingly

@flub flub deleted the flub/reselect-bad-addr branch September 24, 2024 13:21
@flub
Copy link
Contributor Author

flub commented Sep 24, 2024

late to the party I guess. Docs for recent_pong still state

    /// Last [`PongReply`] received.

which is now factually wrong.

The field's naming is good enough to allow for this, but docs should get adjusted accordingly

Fair, but staring at this to try and improve I have not the faintest idea on how to document this right now. The best I can do is look at the callsites and describe them. But like...??

/// Last [`PongReply`] received.
    ///
    /// There are times when this gets cleared: specifically when we think this path is no
    /// longer good at all.  This happens when we are no longer receiving any traffic on it
    /// and pongs no longer arrive.  But also when we are are re-sending pings along all of
    /// the network paths (a "full ping").
    ///
    /// Clearing it ensures that the path won't be considered a candidate because it has a
    /// "recent" pong.

But that's pretty rubbish. I could make a PR with it though if you like.

What probably should happen is that the PathState cleanup that I wanted to start sometime should continue and instead of what "seleect_from_candidates" does now it should use some form of "does_this_path_work" function (but there are already several that are all weird and wrong in some ways). But that is also a larger project and I feel more like pushing multipath forward at the moment before deciding which bits of PathState need to be cleaned up.

Sorry, excuses. I know...

@divagant-martian
Copy link
Contributor

I understand and share the frustration. I'm just unhappy how we are still haunted by go code we didn't design and thus it seems we won't ever fully understand. Rant aside, to reduce the ...awkwardness.. maybe we can just say

/// The most recent [`PongReply`] considered relevant for this path.

/// Previous replies might be cleared when they can't be used to determine the 
/// aliveness of this path, or are generally considered no longer relevant.

Basically I'm still leaning on being somewhat vague so that anyone reading this code leaves at least with "I don't fully understand this field, I'd rather have some care using it" vs "I fully understand this field" with an outright wrong idea of its use.

@matheus23
Copy link
Contributor

How about

/// The most recent [`PongReply`], if more recent than [`PING_TIMEOUT_DURATION`]

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants