[FEA] Use clientId to reject connection requests for peers with existing UCX endpoints #3106
Labels
feature request
New feature or request
P2
Not required for release
shuffle
things that impact the shuffle plugin
This is a tracking issue for work that is going to go into UCX 1.12, so it won't be done anytime soon. That said, I'd like to use this to track our progress testing it.
The issue is that when we connect to peer executors there are two ways of doing that: connect to peer UcpListener, or a peer is connecting to our UcpListener. When we handle a connection from a peer we do not know anything about the remote peer (we just get a "connection request" object from UCX but it doesn't have any id). Because of this, we need to create UCX endpoints and handshake data, which can cause us to loose a race adding extra UCX endpoints. This is not a functional bug, but a resource waste we'd like to fix.
In UCX 1.12 executor A should be able to send the executorId with the connection request to a peer (B), and the request may be rejected if executor B already had initiated a request to executor A.
This is blocked by: openucx/ucx#7136, and the JUCX jar + UCX native libraries for 1.12 being available.
The text was updated successfully, but these errors were encountered: