[FEA] Use clientId to reject connection requests for peers with existing UCX endpoints #3106

abellina · 2021-07-30T18:33:01Z

This is a tracking issue for work that is going to go into UCX 1.12, so it won't be done anytime soon. That said, I'd like to use this to track our progress testing it.

The issue is that when we connect to peer executors there are two ways of doing that: connect to peer UcpListener, or a peer is connecting to our UcpListener. When we handle a connection from a peer we do not know anything about the remote peer (we just get a "connection request" object from UCX but it doesn't have any id). Because of this, we need to create UCX endpoints and handshake data, which can cause us to loose a race adding extra UCX endpoints. This is not a functional bug, but a resource waste we'd like to fix.

In UCX 1.12 executor A should be able to send the executorId with the connection request to a peer (B), and the request may be rejected if executor B already had initiated a request to executor A.

This is blocked by: openucx/ucx#7136, and the JUCX jar + UCX native libraries for 1.12 being available.

abellina added feature request New feature or request ? - Needs Triage Need team to review and classify shuffle things that impact the shuffle plugin P2 Not required for release labels Jul 30, 2021

sameerz removed the ? - Needs Triage Need team to review and classify label Aug 3, 2021

abellina mentioned this issue Dec 14, 2021

Use the new ucx clientId apis #4357

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Use clientId to reject connection requests for peers with existing UCX endpoints #3106

[FEA] Use clientId to reject connection requests for peers with existing UCX endpoints #3106

abellina commented Jul 30, 2021

[FEA] Use clientId to reject connection requests for peers with existing UCX endpoints #3106

[FEA] Use clientId to reject connection requests for peers with existing UCX endpoints #3106

Comments

abellina commented Jul 30, 2021