Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TORQUE trouble #149

Open
wlandau-lilly opened this issue Nov 1, 2017 · 4 comments
Open

TORQUE trouble #149

wlandau-lilly opened this issue Nov 1, 2017 · 4 comments

Comments

@wlandau-lilly
Copy link

wlandau-lilly commented Nov 1, 2017

I am having trouble running batchtools jobs on a local installation of TORQUE on Ubuntu 16.04. I think TORQUE is working because the following test.pbs produces the expected output.

#PBS -N test
#PBS -l nodes=1:ppn=1
#PBS -l walltime=0:01:00

cd $PBS_O_WORKDIR
touch done.txt
echo "done"

However, all my jobs hang in the E state. For example, the following R script waits indefinitely.

library("batchtools")
cf <- makeClusterFunctionsTORQUE("torque.tmpl") 
reg <- makeRegistry(NA)
reg$cluster.functions <- cf
batchMap(fun = identity, x = 1:4)
submitJobs()
waitForJobs() # waits here indefinitely
reduceResultsList() # not reached

In my case, the console message of wait_for_jobs()

Waiting (S:4 R:4 D:0 E:0) [-------------------]   0% eta:  ?s

does not match qstat, which shows jobs hanging in the E state.

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
98.localhost              ...8d7bd98804b04 wlandau         00:00:00 E batch          
99.localhost              ...fcce12fedcace wlandau         00:00:00 E batch          
100.localhost             ...dc63017b37ac6 wlandau         00:00:00 E batch          
101.localhost             ...b060e52879b8e wlandau         00:00:00 E batch 

I am using the @HenrikBengtsson's torque.tmpl from future.batchtools.

Related: see my Stack Overflow post here and HenrikBengtsson/future.batchtools#12.

@mllg
Copy link
Owner

mllg commented Nov 3, 2017

Looks like the system is not set up properly. Can you submit and run jobs manually?

@wlandau-lilly
Copy link
Author

wlandau-lilly commented Nov 3, 2017

Pretty much. For jobs that do not depend on other jobs (as opposed to drake with the future-powered parallel backend), the following test.pbs script generates the correct output.

#PBS -N test
#PBS -l nodes=1:ppn=1
#PBS -l walltime=0:01:00

cd $PBS_O_WORKDIR
touch done.txt
echo "done"

Then the job hangs in the E state indefinitely.

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
46.localhost              test             wlandau         00:00:00 E batch   

I was just using a simple qsub test.pbs.

@mllg
Copy link
Owner

mllg commented Nov 6, 2017

So the manual job also gets stuch in the E state (E for exiting)? Then this is a configuration issue.

@wlandau-lilly
Copy link
Author

Seems about right, I just wish I knew what the right configuration was.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants