-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Submit the SoS job submitter to a compute node #1407
Comments
I'd like to add that the login nodes at my cluster are often congested, so the SoS job submitter runs very slowly. Forwarding the job submitter to a compute node would likely considerably speed up my runs and not annoy my fellow users. |
For this particular request, what you want to do was
I never tried this but with the right Also I think the cluster-executing mode was designed to avoid the trouble by
This would be conceptually easier and more efficient if there are many small substeps. |
If the purpose.is to submit tasks from computing nodes, we may have to
Because
which will then submit jobs to the cluster. These should all in theory work after recent improvement on remote execution (#1418, not yet released). I will create a test case and try it on our cluster. |
It is working. Here is how to reproduce,
Note that the cluster (in this case
Note that What will happen is that a. A script will be submitted as a workflow to
b. When the workflow is executed, a second job will be submitted with script
This example is simple but I suppose the key elements are there. @gaow Let me know if this scenario helps. |
@pgcudahy I got an email notification but did not see your post here. The remote execution feature is currently not very user friendly because there is no way to check the status and stdout/err of remote workflows. I am aware of the problem (#1420) but do not know how to address it yet. That said, you can manually check the status by checking the output of the workflow from the output captured by SLURM, and see what went wrong. It is also possible that the job was not submitted correctly due to lack of I will try again today on a PBS system and post a configuration file for a real cluster, and have a harder look at #1420. |
Thank you. I deleted my post because I'm having some issues with my jobs using both the task queue method I was using before and this new method since upgrading to 0.22.3 and wanted to try and untangle what was specific to this issue. |
I am sorry about that. There have been some changes to the use of named path, basically absolute local paths are no longer translated (#1417), but the change might have introduced other bugs. I will be happy to have a look at your script / configuration if you cannot figure out what caused the problem. |
Just to report my progress. On a true PBS cluster, I have the definition: host:
hpc:
based_on: hosts.cluster
queue_type: pbs
status_check_interval: 30
wait_for_task: false
modules: []
max_running_jobs: 500
submit_cmd: qsub {job_file}
status_cmd: qstat {job_id}
kill_cmd: qdel {job_id}
task_template: |
#!/bin/bash
#PBS -N {task}
#PBS -l nodes={nodes}:ppn={cores}
#PBS -l walltime={walltime}
#PBS -l vmem={mem//10**9}GB
#PBS -o /home/{user_name}/.sos/tasks/{task}.out
#PBS -e /home/{user_name}/.sos/tasks/{task}.err
#PBS -m ae
#PBS -M {user_name}@bcm.edu
#PBS -v {workdir}
module load {' '.join(modules)}
{command}
workflow_template: |
#!/bin/bash
#PBS -N {job_name}
#PBS -l nodes={nodes}:ppn={cores}
#PBS -l walltime={walltime}
#PBS -l vmem={mem}
#PBS -o /home/{user_name}/.sos/workflows/{job_name}.out
#PBS -e /home/{user_name}/.sos/workflows/{job_name}.err
#PBS -m ae
#PBS -M {user_name}@bcm.edu
module load {' '.join(modules)}
{command} From sos notebook with the following workflow %run -q hpc -r hpc mem=2GB cores=1 walltime=00:10:00 nodes=1 -s force
input: for_each=dict(i=range(2))
output: f'test_{i}.txt'
task: walltime='10m', cores=1, mem='1G'
sh: expand=True
echo `pwd` > {_output}
echo I am {i} >> {_output} The workflow is correctly submitted to the cluster, which results in a directory I got error
|
ok, just to update that with #1420, it is now possible to check the status of remote workflows using command
or
or
here Also
and
should also work on remote workflows submitted via Now we are in a better position to figure out details on |
OK, the problem is that, when the workflow is submitted to a computing node, it still get hold of the specified configuration file, and think it is on So the key is to let |
The solution of this problem is to let the computing node "think" that they are not on the headnode. The trick is to
to ~/.sos/hosts.yml`
In this way, when the workflow is executed on The output of the workflow is
I do not know if I should do anything to avoid the need to define |
The case was specifically handled so no aliasing of
now works. |
Similar to cumc/dsc#5, in SoS we often have to submit a job to the cluster compute node that submits SoS tasks from there. An interface is proposed in the DSC ticket, and an implementation potentially will not need to involve SoS. But I'm opening this ticket just in case it is something better solved at SoS level.
The text was updated successfully, but these errors were encountered: