Started jobs hang at "Running calculations ..." #261

mhesselbarth · 2021-04-16T13:12:11Z

mhesselbarth
Apr 16, 2021

Hello,

I am currently having the problem that jobs are sent to the workers, but it seems they never really start and thus get canceled due to the time limit. The code itself should be okay since it runs without problem using e.g. the future package and all I'm doing is to get nodename (fx <- function(x) {Sys.sleep(30); Sys.info()["nodename"]}.

My first guess is that the works cannot communicate because they don't find zeromq. I tried to set the LD_LIBRARY_PATH to the installation of zeromq, but this didn't help (setenv ('LD_LIBRARY_PATH', 'home/mhessel/zeromq-4.0.3/')).

Worker log

2021-04-16 08:40:25.777142 | Master: tcp://gl-login2.arc-ts.umich.edu:7313
2021-04-16 08:40:25.798204 | WORKER_UP to: tcp://gl-login2.arc-ts.umich.edu:7313
slurmstepd: error: *** JOB 19291379 ON gl3031 CANCELLED AT 2021-04-16T08:42:39 DUE TO TIME LIMIT ***

SSH log

> clustermq:::ssh_proxy(ctl=51896, job=50915)
master ctl listening at: tcp://127.0.0.1:51896
forwarding local network from: tcp://gl-login2.arc-ts.umich.edu:7313
sent PROXY_UP to master ctl
received common data:function (x) {    Sys.sleep(30)    Sys.info()["nodename"]}
setting up qsys: SLURM
sent PROXY_READY to master ctl
received: PROXY_CMDqsys$submit_jobs(job_name = "clustermq", service = "short", mem_cpu = 512, walltime = "00:02:00", log_file = "clustermq.log", n_jobs = 3, log_worker = TRUE, verbose = TRUE)
Submitting 3 worker jobs (ID: clustermq) ...
received: PROXY_STOPTRUE
shutting down and cleaning up
Master: [247.2s 0.0% CPU]; Worker: [avg NA% CPU, max NA Mb]

Thank you very much

Answered by mschubert

Apr 27, 2021

Ok, that makes it easier because now we know the issue is a connection problem from the workers to the login node, and not related to ssh.

Your login node likely has multiple network interfaces, and if a worker tries to connect to Sys.info()["nodename"] it resolves to the wrong interface.

You likely need to set options(clustermq.host="<interface that accepts worker connections>").

You can list your network interfaces using the ifconfig command, which will look something like the following:

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        ...

em3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  …

View full answer

mschubert · 2021-04-20T11:51:28Z

mschubert
Apr 20, 2021
Maintainer

This looks less like a library issue, more like a network (SSH) forwarding issue.

Can you tell me:

Does your code work if you run it on your login node instead of via SSH?
Which version of clustermq are you using?
Did this work before? If yes, what changed? (e.g. package update from version X to version Y)

0 replies

mhesselbarth · 2021-04-26T17:49:08Z

mhesselbarth
Apr 26, 2021
Author

Hey,

Interesting that this might be a SSH issue.

I am using clustermq_0.8.95.1
I used clustermq before, but on a different HPC. On the HPC I am using currently I never used clustermq and I am also not aware somebody else did.

Mmh... running on the login node doesn't work and Clustermq get stuck during this step:

Submitting 3 worker jobs (ID: clustermq) ...
Running 3 calculations (0 objs/0 Mb common; 1 calls/chunk) ...

Which is the same step where it gets stuck when using SSH.

0 replies

mschubert · 2021-04-27T07:42:02Z

mschubert
Apr 27, 2021
Maintainer

Ok, that makes it easier because now we know the issue is a connection problem from the workers to the login node, and not related to ssh.

Your login node likely has multiple network interfaces, and if a worker tries to connect to Sys.info()["nodename"] it resolves to the wrong interface.

You likely need to set options(clustermq.host="<interface that accepts worker connections>").

You can list your network interfaces using the ifconfig command, which will look something like the following:

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        ...

em3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.23.44.3  netmask 255.255.252.0  broadcast 172.23.47.255
        inet6 fe80::eef4:bbff:fece:2514  prefixlen 64  scopeid 0x20<link>
        ...

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 65520
        inet 172.23.52.3  netmask 255.255.252.0  broadcast 172.23.55.255
        inet6 fe80::f652:1403:79:8a11  prefixlen 64  scopeid 0x20<link>
        ...

Here, Sys.info()["nodename"] resolved to the em interface, which did not accept incoming connections. Setting options(clustermq.host="ib0") solved the issue.

You can use this code to check which interface the node name resolves to:

R -e 'system(paste("nslookup", Sys.info()["nodename"]))'
> Name:	your.node.name
> Address: 172.23.44.3 # <- this matches the inet resolved from the node name

To decide which interface to use instead, either (1) check manually for incoming connections e.g. using netcat, or (2) try different interfaces until it works.

5 replies

c1au6i0 Nov 30, 2021

Yes, I have same problem just trying to run the user guide example using Slurm. It submits the jobs but hangs with same error.
I have an extremely naive follow up question. How do actually establish the "interface that accepts worker connections".
Thanks!

mschubert Nov 30, 2021
Maintainer

I added more explanation above.

c1au6i0 Nov 30, 2021

@mschubert thank you for your help, it is very appreciated! In my case, I have lo, ib0, em1-4. I have tried all of them without success. I am working on a clinically graded node, not sure if that adds an extra level of security and/or has any connection with this.

mschubert Nov 30, 2021
Maintainer

There may well be (login) nodes that do not accept incoming connections from jobs at all (in which case that's a limitation imposed by your sys admins that will unfortunately block clustermq entirely). An alternative might be to run your main process in a job as well (if job-job connections are allowed), or use e.g. the batchtools package that transmits data via the file system instead.

c1au6i0 Nov 30, 2021

Yes, it seems that job-job connections work and to be a viable workaround. Thanks @mschubert !!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Started jobs hang at "Running calculations ..." #261

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Started jobs hang at "Running calculations ..." #261

mhesselbarth Apr 16, 2021

Replies: 3 comments · 5 replies

mschubert Apr 20, 2021 Maintainer

mhesselbarth Apr 26, 2021 Author

mschubert Apr 27, 2021 Maintainer

c1au6i0 Nov 30, 2021

mschubert Nov 30, 2021 Maintainer

c1au6i0 Nov 30, 2021

mschubert Nov 30, 2021 Maintainer

c1au6i0 Nov 30, 2021

mhesselbarth
Apr 16, 2021

Replies: 3 comments 5 replies

mschubert
Apr 20, 2021
Maintainer

mhesselbarth
Apr 26, 2021
Author

mschubert
Apr 27, 2021
Maintainer

mschubert Nov 30, 2021
Maintainer

mschubert Nov 30, 2021
Maintainer