Skip to content
This repository has been archived by the owner on May 26, 2022. It is now read-only.

Fix simultaneous dials #250

Merged
merged 29 commits into from
Apr 1, 2021
Merged

Fix simultaneous dials #250

merged 29 commits into from
Apr 1, 2021

Conversation

vyzo
Copy link
Contributor

@vyzo vyzo commented Mar 30, 2021

Alternative to #246

This implements a join/split algorithm for synchronizing simultaneous dials through a dial worker.

@vyzo vyzo requested a review from Stebalien March 30, 2021 14:17
@vyzo vyzo mentioned this pull request Mar 30, 2021
@vyzo vyzo changed the title Fix simultaneous dial Fix simultaneous dials Mar 30, 2021
@vyzo vyzo marked this pull request as ready for review March 30, 2021 18:36
@vyzo
Copy link
Contributor Author

vyzo commented Mar 30, 2021

TestConnectionGating appears to be flaky, it intermittently fails on the base branch as well.
Opened issue #251.

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of testing, the codecov comments are actually pretty useful.

dial_sync.go Outdated Show resolved Hide resolved
swarm_dial.go Outdated Show resolved Hide resolved
dial_sync.go Show resolved Hide resolved
swarm_dial.go Show resolved Hide resolved
swarm_dial.go Outdated Show resolved Hide resolved
swarm_dial.go Outdated Show resolved Hide resolved
swarm_dial.go Show resolved Hide resolved
}

// the request has some pending or new dials, track it and schedule new dials
reqno++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can also just use the request object as a key, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. We need to pack request ids in the addrDial struct and then check to see if they have been complete (removed from the tracking).
If we used the request object as key we'd have to copy the requests all over the place and then do extra management stuff to keep them up-to-date.

What's wrong with using an integer? It doesn't cost anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works, it's just something else to track.

If we used the request object as key we'd have to copy the requests all over the place and then do extra management stuff to keep them up-to-date.

Can't we use the pointers? No need to copy anything and should be the same size as an int.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(obviously not a huge deal)

Copy link
Contributor Author

@vyzo vyzo Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we could use a req pointer but then we'd have to reify the struct and move it to the heap -- which is strictly worse than using integers, because there is one more thing for the gc to track. Not a huge impact though.

I would agree with tracking overhead, but the way we use the integers is really simple,
The reqno is incremented at exactly one place, right before we add to the tracking map, so there is very little mental overhead involved.

I am really inclined to leave it as is, unless you feel very strongly about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to leave it as it is, but I think I'm confused about something. The request is allocated here:

https://github.com/libp2p/go-libp2p-swarm/pull/250/files#diff-3025a253202bbb3d70750d06692c51f9361993ecafcce2bddbebd5508337faaaR415-R419

and we're storing it by reference in the requests map. It should be possible to just change the requests map to map[*pendRequest]struct{}. and ad.requests to []*pendRequest.

Actually... I think we can even get rid of the requests map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't that easily, we will have to track what request is still active.
That's the main reason for the indirection, and that's why I like the integers as pointers approach -- it forces you to check for validity when you want to use it.

swarm_dial.go Show resolved Hide resolved
swarm_dial.go Show resolved Hide resolved
swarm_dial.go Outdated Show resolved Hide resolved
swarm_dial.go Outdated Show resolved Hide resolved
swarm_dial.go Outdated Show resolved Hide resolved
swarm_dial.go Outdated Show resolved Hide resolved
@Stebalien
Copy link
Member

First pass review. I like the approach.

@vyzo vyzo requested a review from Stebalien March 31, 2021 13:44
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good but I want to take a closer look when I'm more awake (got a bit distracted today).

However, can we please break the batching into a separate PR? This PR is already very large and the batching going to trigger another few rounds of discussion.

}

// the request has some pending or new dials, track it and schedule new dials
reqno++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(obviously not a huge deal)

@vyzo
Copy link
Contributor Author

vyzo commented Apr 1, 2021

However, can we please break the batching into a separate PR? This PR is already very large and the batching going to trigger another few rounds of discussion.

Sure, I'll do that after fixing the test package situation.

@vyzo vyzo mentioned this pull request Apr 1, 2021
@vyzo
Copy link
Contributor Author

vyzo commented Apr 1, 2021

Removed the dial batching -- follow up pr in #252.

so that we exercise the dialWorker dial to self error path
they work with private data types, so there is no point in having them public
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! We'll have to deploy this on infra to test but we might as well merge it now then back out if we find an issue.

swarm_dial.go Outdated Show resolved Hide resolved
}

// the request has some pending or new dials, track it and schedule new dials
reqno++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to leave it as it is, but I think I'm confused about something. The request is allocated here:

https://github.com/libp2p/go-libp2p-swarm/pull/250/files#diff-3025a253202bbb3d70750d06692c51f9361993ecafcce2bddbebd5508337faaaR415-R419

and we're storing it by reference in the requests map. It should be possible to just change the requests map to map[*pendRequest]struct{}. and ad.requests to []*pendRequest.

Actually... I think we can even get rid of the requests map.

return goodAddrs, nil
}

func (s *Swarm) mergeDialContexts(a, b context.Context) context.Context {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this but let's punt on making any more changes till a followup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fair enough.

swarm_dial.go Outdated Show resolved Hide resolved
// NonWS > WS
// Private > Public
// UDP > TCP
func (s *Swarm) rankAddrs(addrs []ma.Multiaddr) []ma.Multiaddr {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really any better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Yes it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's consistent, the code is not a jumble mess of ifs, and it is extensible...

// Private > Public
// UDP > TCP
func (s *Swarm) rankAddrs(addrs []ma.Multiaddr) []ma.Multiaddr {
addrTier := func(a ma.Multiaddr) (tier int) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could make this more generic by ranking according to a list of functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e.:

var rankFunctions = []func(ma.Multiaddr) bool{}

func rank(m ma.Multiaddr) int {
    rank := 0
    for i, f := range rankFunctions {
        if f(m) {
           rank |= (1<<i)
        }
    }
    return rank
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we could -- let's keep this in mind for subsequent iterations; it's just 4 bits now.

@Stebalien
Copy link
Member

@vyzo merge when ready. This is a great step forward.

@vyzo vyzo merged commit fe4ee0b into fix/direct-connect Apr 1, 2021
@vyzo vyzo deleted the fix/simultaneous-dial branch April 1, 2021 20:36
@vyzo vyzo mentioned this pull request Apr 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants