Distributed worker manager doesn't use socket connection to infer worker ip #85

Moelf · 2022-10-01T03:26:49Z

for some reason we don't use the fact that we can call Sockets.getpeername() here, instead we read the stdout of the worker process.

This is problemmatic mainly because:

the worker nodes always report the first IPv4 interface's address no matter if that's actually the interface it used to contact main node:
https://github.com/JuliaLang/julia/blob/0d00660a38f4d4049e12a97399e4ef613bf0d7dc/stdlib/Sockets/src/addrinfo.jl#L272-L276
the worker node may be running inside container (or whatever reason has virtual interface before everything else)

my questions: can we add a specialization for read_worker_host_port when config.io :: Sockets.TCPSocket?

The text was updated successfully, but these errors were encountered:

Moelf · 2022-10-01T04:37:03Z

bash-4.2$ route | grep '^default' | grep -o '[^ ]*$'
ens1f0.3604

shows that we should be using:

192.170.240.0

but the first IP address libuv came up with is 192.168.240.0;

I couldn't find how to look for the default interface in libuv

Moelf · 2022-10-01T04:44:14Z

to filter out private IP range

Moelf mentioned this issue Oct 1, 2022

[WIP] use TCPSocket to determine worker ip and port JuliaLang/julia#46996

Closed

Moelf mentioned this issue Jun 16, 2023

Update ElasticManager constructor auto IP config JuliaParallel/ClusterManagers.jl#190

Merged

vtjnash transferred this issue from JuliaLang/julia Feb 11, 2024

Provide feedback