Started jobs hang at "Running calculations ..." #261
-
Hello, I am currently having the problem that jobs are sent to the workers, but it seems they never really start and thus get canceled due to the time limit. The code itself should be okay since it runs without problem using e.g. the My first guess is that the works cannot communicate because they don't find zeromq. I tried to set the Worker log
SSH log
Thank you very much |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 5 replies
-
This looks less like a library issue, more like a network (SSH) forwarding issue. Can you tell me:
|
Beta Was this translation helpful? Give feedback.
-
Hey, Interesting that this might be a SSH issue.
Mmh... running on the login node doesn't work and Clustermq get stuck during this step:
Which is the same step where it gets stuck when using SSH. |
Beta Was this translation helpful? Give feedback.
-
Ok, that makes it easier because now we know the issue is a connection problem from the workers to the login node, and not related to ssh. Your login node likely has multiple network interfaces, and if a worker tries to connect to You likely need to set You can list your network interfaces using the
Here, You can use this code to check which interface the node name resolves to:
To decide which interface to use instead, either (1) check manually for incoming connections e.g. using netcat, or (2) try different interfaces until it works. |
Beta Was this translation helpful? Give feedback.
Ok, that makes it easier because now we know the issue is a connection problem from the workers to the login node, and not related to ssh.
Your login node likely has multiple network interfaces, and if a worker tries to connect to
Sys.info()["nodename"]
it resolves to the wrong interface.You likely need to set
options(clustermq.host="<interface that accepts worker connections>")
.You can list your network interfaces using the
ifconfig
command, which will look something like the following: