Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TORQUE trouble #12

Closed
wlandau-lilly opened this issue Oct 30, 2017 · 4 comments
Closed

TORQUE trouble #12

wlandau-lilly opened this issue Oct 30, 2017 · 4 comments

Comments

@wlandau-lilly
Copy link

I am having a problem similar to #11 with TORQUE (on Ubuntu 16.04). I also posted on Stack Overflow. A test.pbs script seems to work just fine.

#PBS -N test
#PBS -l nodes=1:ppn=1
#PBS -l walltime=0:01:00

cd $PBS_O_WORKDIR
touch done.txt
echo "done"
qsub test.pbs

But my jobs hang for future_lapply() via future.batchtools.

library(future.batchtools)
plan(batchtools_torque(template = "torque.tmpl"))
future_lapply(1:2, cat)

With torque.tmpl:

## Job name:
#PBS -N <%= if (exists("job.name", mode = "character")) job.name else job.hash %>

## Direct streams to logfile:
#PBS -o <%= log.file %>

## Merge standard error and output:
#PBS -j oe

## Launch R and evaluated the batchtools R job
#Rscript -e 'batchtools::doJobCollection("<%= uri %>")'

My qstat shows

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
65.localhost              ...0f2c92668271a wlandau         00:00:00 E batch          
66.localhost              ...a920933c92e31 wlandau         00:00:00 E batch  

And tracejob -n2 65:

/var/spool/torque/server_priv/accounting/20171029: No matching job records located
/var/spool/torque/server_logs/20171029: No matching job records located
/var/spool/torque/mom_logs/20171029: No matching job records located
/var/spool/torque/sched_logs/20171029: No matching job records located

Job: 65.localhost

10/30/2017 17:27:25  S    enqueuing into batch, state 1 hop 1
10/30/2017 17:27:25  S    Job Queued at request of wlandau@localhost, owner =
                          wlandau@localhost, job name = job9b479ac148b11cd2e300f2c92668271a,
                          queue = batch
10/30/2017 17:27:25  S    Job Modified at request of Scheduler@Haggunenon
10/30/2017 17:27:25  S    Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb
                          resources_used.vmem=0kb resources_used.walltime=00:00:00
10/30/2017 17:27:25  L    Job Run
10/30/2017 17:27:25  S    Job Run at request of Scheduler@Haggunenon
10/30/2017 17:27:25  S    Not sending email: User does not want mail of this type.
10/30/2017 17:27:25  S    Not sending email: User does not want mail of this type.
10/30/2017 17:27:25  M    job was terminated
10/30/2017 17:27:25  M    obit sent to server
10/30/2017 17:27:25  A    queue=batch
10/30/2017 17:27:25  M    scan_for_terminated: job 65.localhost task 1 terminated, sid=11430
10/30/2017 17:27:25  A    user=wlandau group=wlandau
                          jobname=job9b479ac148b11cd2e300f2c92668271a queue=batch
                          ctime=1509398845 qtime=1509398845 etime=1509398845 start=1509398845
                          owner=wlandau@localhost exec_host=localhost/0 Resource_List.ncpus=1
                          Resource_List.neednodes=1 Resource_List.nodect=1
                          Resource_List.nodes=1 Resource_List.walltime=01:00:00 
10/30/2017 17:27:25  A    user=wlandau group=wlandau
                          jobname=job9b479ac148b11cd2e300f2c92668271a queue=batch
                          ctime=1509398845 qtime=1509398845 etime=1509398845 start=1509398845
                          owner=wlandau@localhost exec_host=localhost/0 Resource_List.ncpus=1
                          Resource_List.neednodes=1 Resource_List.nodect=1
                          Resource_List.nodes=1 Resource_List.walltime=01:00:00 session=11430
                          end=1509398845 Exit_status=0 resources_used.cput=00:00:00
                          resources_used.mem=0kb resources_used.vmem=0kb
                          resources_used.walltime=00:00:00

and qstat -f:

Job Id: 65.localhost
    Job_Name = job9b479ac148b11cd2e300f2c92668271a
    Job_Owner = wlandau@localhost
    resources_used.cput = 00:00:00
    resources_used.mem = 0kb
    resources_used.vmem = 0kb
    resources_used.walltime = 00:00:00
    job_state = E
    queue = batch
    server = haggunenon
    Checkpoint = u
    ctime = Mon Oct 30 17:27:25 2017
    Error_Path = localhost:/home/wlandau/Desktop/torque/job9b479ac148b11cd2e30
        0f2c92668271a.e65
    exec_host = localhost/0
    Hold_Types = n
    Join_Path = oe
    Keep_Files = n
    Mail_Points = a
    mtime = Mon Oct 30 17:27:25 2017
    Output_Path = Haggunenon:/home/wlandau/Desktop/torque/.future/20171030_172
        725-6xjpT8/batchtools_516383701/logs/job9b479ac148b11cd2e300f2c9266827
        1a.log
    Priority = 0
    qtime = Mon Oct 30 17:27:25 2017
    Rerunable = True
    Resource_List.ncpus = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    Resource_List.walltime = 01:00:00
    session_id = 11430
    Variable_List = PBS_O_QUEUE=batch,PBS_O_HOST=localhost,
        PBS_O_HOME=/home/wlandau,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=wlandau,
        PBS_O_PATH=/home/wlandau/bin:/home/wlandau/.local/bin:/usr/local/sbin
        :/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/ga
        mes:/snap/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=localhost,
        PBS_O_WORKDIR=/home/wlandau/Desktop/torque
    comment = Job started on Mon Oct 30 at 17:27
    etime = Mon Oct 30 17:27:25 2017
    exit_status = 0
    submit_args = /tmp/Rtmp6xjpT8/job9b479ac148b11cd2e300f2c92668271a.job
    start_time = Mon Oct 30 17:27:25 2017
    Walltime.Remaining = 3548
    start_count = 1
    fault_tolerant = False

Job Id: 66.localhost
    Job_Name = jobe0c18b35ffbe546515ea920933c92e31
    Job_Owner = wlandau@localhost
    resources_used.cput = 00:00:00
    resources_used.mem = 0kb
    resources_used.vmem = 0kb
    resources_used.walltime = 00:00:00
    job_state = E
    queue = batch
    server = haggunenon
    Checkpoint = u
    ctime = Mon Oct 30 17:27:27 2017
    Error_Path = localhost:/home/wlandau/Desktop/torque/jobe0c18b35ffbe546515e
        a920933c92e31.e66
    exec_host = localhost/1
    Hold_Types = n
    Join_Path = oe
    Keep_Files = n
    Mail_Points = a
    mtime = Mon Oct 30 17:27:27 2017
    Output_Path = Haggunenon:/home/wlandau/Desktop/torque/.future/20171030_172
        725-6xjpT8/batchtools_344093744/logs/jobe0c18b35ffbe546515ea920933c92e
        31.log
    Priority = 0
    qtime = Mon Oct 30 17:27:27 2017
    Rerunable = True
    Resource_List.ncpus = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    Resource_List.walltime = 01:00:00
    session_id = 11455
    Variable_List = PBS_O_QUEUE=batch,PBS_O_HOST=localhost,
        PBS_O_HOME=/home/wlandau,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=wlandau,
        PBS_O_PATH=/home/wlandau/bin:/home/wlandau/.local/bin:/usr/local/sbin
        :/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/ga
        mes:/snap/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=localhost,
        PBS_O_WORKDIR=/home/wlandau/Desktop/torque
    comment = Job started on Mon Oct 30 at 17:27
    etime = Mon Oct 30 17:27:27 2017
    exit_status = 0
    submit_args = /tmp/Rtmp6xjpT8/jobe0c18b35ffbe546515ea920933c92e31.job
    start_time = Mon Oct 30 17:27:27 2017
    Walltime.Remaining = 3550
    start_count = 1
    fault_tolerant = False
@HenrikBengtsson
Copy link
Owner

This might be better suited for https://github.com/mllg/batchtools/ - to find out, do observer the same if your run it in pure batchtools (without the future framework), see my minimalistic code in #11 (comment) for an example - with the obvious tweaks to run TORQUE.

@wlandau-lilly
Copy link
Author

wlandau-lilly commented Oct 31, 2017

All went smoothly, and I figured out that it was the configuration file. I had modified it, and I reverted to yours.

I am still having trouble with the drake example, though: as I explain here, all the jobs hang in the E state in qstat, and any number of jobs more than 4 will just stay queued. To keep things from hanging indefinitely, I have to sudo qdel -p each one manually, which I think is causing the final report.md target to stay out of date no matter how many times I run make().

Should I migrate this thread to batchtools?

@HenrikBengtsson
Copy link
Owner

If you can reproduce using batchtools alone, then yes migrate there. (future.batchtools is just a "futurizing" wrapper for batchtools.)

@wlandau-lilly
Copy link
Author

Closing in favor of #12. This is not a problem with future.batchtools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants