Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run cactus on the grid_engine cluster #5022

Closed
minglibio opened this issue Jul 18, 2024 · 19 comments · Fixed by #5061
Closed

run cactus on the grid_engine cluster #5022

minglibio opened this issue Jul 18, 2024 · 19 comments · Fixed by #5061

Comments

@minglibio
Copy link

minglibio commented Jul 18, 2024

Hi,
I am trying to run cactus v2.8.4 on a grid_engine cluster with the following command:

cactus ./js ./test.txt test.hal --workDir ./temp --maxCores 24 --maxMemory 200G --doubleMem true  --realTimeLogging True --batchSystem grid_engine

I am keeping to get the error NotImplementedError.

[2024-06-28T18:44:37+0200] [MainThread] [W] [toil.common] Batch system does not support auto-deployment. The user script ModuleDescriptor(dirPath='/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/lib/python3.10/site-packages', name='cactus.progressive.cactus_progressive', fromVirtualEnv=True) will have to be present at the same location on every worker.
[2024-06-28T18:44:37+0200] [MainThread] [I] [toil] Running Toil version 7.0.0-d569ea5711eb310ffd5703803f7250ebf7c19576 on host scc2.
[2024-06-28T18:44:37+0200] [MainThread] [I] [toil.realtimeLogger] Starting real-time logging.
[2024-06-28T18:44:37+0200] [MainThread] [I] [toil.leader] Issued job 'progressive_workflow' kind-progressive_workflow/instance-4tt4gtdk v1 with job batch system ID: 1 and disk: 2.0 Gi, memory: 2.0 Gi, cores: 1, accelerators: [], preemptible: False
[2024-06-28T18:44:38+0200] [Thread-2] [E] [toil.lib.retry] Got a <class 'NotImplementedError'>:  which is not retriable according to <function AbstractGridEngineBatchSystem.with_retries.<locals>.<lambda> at 0x7fc5f96b3490>
[2024-06-28T18:44:39+0200] [MainThread] [I] [toil.leader] 0 jobs are running, 1 jobs are issued and waiting to run
[2024-06-28T18:44:39+0200] [Thread-2] [E] [toil.lib.retry] Got a <class 'NotImplementedError'>:  which is not retriable according to <function AbstractGridEngineBatchSystem.with_retries.<locals>.<lambda> at 0x7fc5f96b3490>
[2024-06-28T18:44:40+0200] [Thread-2] [E] [toil.lib.retry] Got a <class 'NotImplementedError'>:  which is not retriable according to <function AbstractGridEngineBatchSystem.with_retries.<locals>.<lambda> at 0x7fc5f96b3490>
[2024-06-28T18:44:41+0200] [Thread-2] [E] [toil.lib.retry] Got a <class 'NotImplementedError'>:  which is not retriable according to <function AbstractGridEngineBatchSystem.with_retries.<locals>.<lambda> at 0x7fc5f96b3490>
[2024-06-28T18:44:42+0200] [Thread-2] [E] [toil.lib.retry] Got a <class 'NotImplementedError'>:  which is not retriable according to <function AbstractGridEngineBatchSystem.with_retries.<locals>.<lambda> at 0x7fc5f96b3490>
[2024-06-28T18:44:43+0200] [Thread-2] [E] [toil.lib.retry] Got a <class 'NotImplementedError'>:  which is not retriable according to <function AbstractGridEngineBatchSystem.with_retries.<locals>.<lambda> at 0x7fc5f96b3490>
[2024-06-28T18:44:44+0200] [Thread-2] [E] [toil.lib.retry] Got a <class 'NotImplementedError'>:  which is not retriable according to <function AbstractGridEngineBatchSystem.with_retries.<locals>.<lambda> at 0x7fc5f96b3490>

Should I do anything to avoid this error?

Other question:
In this cluster, we have several different queues. I want to run cactus on some specific queues bc they are faster. Which parameter should I add to achieve this need? Should I add the following environment variables?

export TOIL_GRIDENGINE_PE='smp'
export TOIL_GRIDENGINE_ARGS='-q queue1,queue2'

Best,
Ming

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1617

@stxue1
Copy link
Contributor

stxue1 commented Jul 23, 2024

@minglibio Thanks for reporting, I don't think the NotImplementedError should be happening. Do you happen to have a traceback from this workflow that you could provide?

Regarding running cactus on specific queues, Cactus uses Toil to run it's entire workflow, meaning any Toil argument should apply to the cactus run itself. Those two environment variables should be the only ones you need to set. Although I'm unsure if passing multiple queues to the batch system is supported (this is likely dependent on the batch system and not Toil). If export TOIL_GRIDENGINE_ARGS='-q queue1,queue2' does not work, you can try one queue like export TOIL_GRIDENGINE_ARGS='-q queue1'.

@unito-bot
Copy link

➤ Adam Novak commented:

I think part of fixing this might be trawling the GridEngineBatchSystem for any required methods that aren’t implemented. We don’t use actual Grid Engine itself in CI so it’s possible we added a NotInplementedError abstract method to AbstractGridEngineBatchSystem and forgot the implementation here.

@DustinSokolowski
Copy link

Hey!
Just as a quick follow-up, I got the exact same error. My exact command was this:

source /.mounts/labs/simpsonlab/users/dsokolowski/projects/annotation_pipeline/external/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/activate

export PATH="/.mounts/labs/simpsonlab/users/dsokolowski/projects/annotation_pipeline/external/cactus-bin-v2.8.4/bin:$PATH"
export TOIL_GRIDENGINE_ARGS='-V -q all.q -P simpsonlab'  

cactus jobstore cactus_in.txt target_ref.hal --binariesMode local --maxMemory 24G --realTimeLogging True --batchSystem grid_engine --consCores 4 --workDir /.mounts/labs/simpsonlab/users/dsokolowski/proj
ects/annotation_pipeline/mHetGlaV3_test/sge_cactus_binary

The error is thrown basically immediately.
Could it be due to a depricated toil method?
Screenshot 2024-08-13 at 11 45 04 PM

The summary of the toil requirements also seem fine, so I'm not sure why it would have depricated stuff.

toil_requirement.txt

Best,
Dustin

@DustinSokolowski
Copy link

I'm not sure if this is helpful or relevant but I tried re-installing everything with cactus 2.9 in an otherwise empty environment.

The toil requirements text file is this

backports.zoneinfo[tzdata];python_version<"3.9"
toil[aws]==7.0.0

While I didn't get any notes or warnings, I did see this line in the installation:
Ignoring backports.zoneinfo: markers 'python_version < "3.9"' don't match your environment

Could this be leading to an issue with running toil on SGE?

Best,
Dustin

@stxue1
Copy link
Contributor

stxue1 commented Aug 15, 2024

I think the NotImplementedError may be a bit misleading. The function coalesce_job_exit_codes may be not implemented, but the exception is supposed to be caught. But the with_retries function logs the exception even if caught. Technically, even though there is a logger error, the default behavior should be working.

The backports message should be fine when running Python 3.9 and above, as the backports.zoneinfo library only exists to support the zoneinfo module that was added in Python 3.9's standard library.

#5061 should fix this logging error though. You can install Toil from source at that branch by running

pip3 install 'toil @ git+https://github.com/DataBiosphere/toil.git@issues/5022-not-implemented-gridengine'

or

pip3 install git+https://github.com/DataBiosphere/toil.git@issues/5022-not-implemented-gridengine#egg=toil

You may need to uninstall toil first with pip uninstall toil. If you need extras, something like

pip3 install 'toil[aws] @ git+https://github.com/DataBiosphere/toil.git@issues/5022-not-implemented-gridengine'

should work.

@DustinSokolowski @minglibio We don't have a SGE cluster here to test on, so please do tell us if the fix works.

@DustinSokolowski
Copy link

Hey!

Thank you for the quick response. So far so good!

Screenshot 2024-08-15 at 6 50 26 PM

It also looks like cactus was able to run with the "singleComputer" option when submitting jobs to the SGE, which is encouraging. I will open up a new issue if I run into issues with cactus using "grid_engine". We're assembling a number of species in the same family tree and are hoping to do an MSA using cactus on our SGE cluster, so it will be nice to get cactus working in both modes.

best,
Dustin

@minglibio
Copy link
Author

Hey @stxue1

Thanks for your help.

After installed the new toil, the NotImplementedError was successfully fixed. But there are still some errors to prevent me from running cactus.
My command:

#!/bin/bash
#$ -N cactus    #task name
#$ -wd /data/scc3/ming.li/software/cactus-bin-v2.8.4/test    #work dir
#$ -pe smp 2    #slot
#$ -l h_vmem=10G    #memory
#$ -l h_rt=960:00:00    #run time

export TOIL_GRIDENGINE_PE='smp'
export TOIL_GRIDENGINE_ARGS='-q long'

source /data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/activate

mkdir -p ./temp

cactus ./js ../examples/evolverMammals.txt ./evolverMammals.hal --workDir ./temp --maxCores 12 --maxMemory 100G --doubleMem true --realTimeLogging True --batchSystem grid_engine

Attached is the log.
cactus.e5297895.txt

@DustinSokolowski
Copy link

Hey @minglibio

I was looking through the error file and I noticed that your sge is fully rejecting the first job

Job failed with exit value 2: 'progressive_workflow' kind-progressive_workflow/instance-ilu1jvv0 v1 Exit reason: None

Is there a possibility that there is an incompatibility with your queue commands and toils? For example for us we need to assign a project and a job name or else the cluster rejects us outright (albiet it tells us why). Do you know if you have to set a min hmem or something like that?
another question is export TOIL_GRIDENGINE_ARGS='-q long'. Can your long queue support the insane number of jobs that cactus runs? Given a mammalian genome submits 10k jobs I think our -q long would kick us off. This being said since cactus is crashing at your first job I doubt it has to do with your forcing "long"

@minglibio
Copy link
Author

Hey @DustinSokolowski

I tested it, and it can run a job without any parameters. In this way, the job name will be assigned as the shell file name.

I ran the inner example of cactus (only a few species and a few parts of chromosome 5) and I don't expect it will generate so many jobs... I never submit 10k jobs one time, but 1k should be fine in our long queue.

I noticed that in your last run, you used cactus v2.9, maybe I need to update my cactus version...
How about your jobs, do they run smoothly with the SGE cluster mode?

Best,
Ming

@DustinSokolowski
Copy link

Hey!

Yeah I was sort of trying everything RE: cactus version. I'm not sure it made a difference.

I'm not sure I have a great answer for your question. I haven't made it to the end of a pipeline on grid_engine mode (though I now have on singleMachine). Here's a screenshot of the current log. It's able to run a lot of jobs and some jobs also fail. I think this is the expected behaviour (and toil retries).
Screenshot 2024-08-16 at 12 11 19 PM

@stxue1
Copy link
Contributor

stxue1 commented Aug 16, 2024

The line from the log import: unable to open X server ' @ error/import.c/ImportImageCommand/349.` kind of sounds like it's trying to run a python script as a bash script.

Since the _toil_worker seems to be invoked on the cactus side as some sort of binary/pointer, this could be more of a cactus issue. @glennhickey @diekhans Does the log look more like a cactus related problem to you?

=========>
	import: unable to open X server `' @ error/import.c/ImportImageCommand/349.
	import: unable to open X server `' @ error/import.c/ImportImageCommand/349.
	/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/_toil_worker: line 5: from: command not found
	/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/_toil_worker: _toil_worker: line 7: syntax error near unexpected token `('
	/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/_toil_worker: _toil_worker: line 7: `    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])'
<=========

@diekhans
Copy link
Collaborator

diekhans commented Aug 16, 2024 via email

@minglibio
Copy link
Author

@diekhans

You mean the _toil_worker?

Here is the file:

#!/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from toil.worker import main
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(main())

@stxue1
Copy link
Contributor

stxue1 commented Aug 21, 2024

Reopening as the issue itself isn't fully resolved.

Here is the file:

#!/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from toil.worker import main
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(main())

Looks like it is being executed as a bash/shell script instead of python. The toil worker file matches the logged error. from is the import on line 5 and sys.argv[0] = re.sub... is on line 7:

	/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/_toil_worker: line 5: from: command not found
	/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/_toil_worker: _toil_worker: line 7: syntax error near unexpected token `('
	/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/_toil_worker: _toil_worker: line 7: `    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])'
<=========

@stxue1 stxue1 reopened this Aug 21, 2024
@minglibio
Copy link
Author

Should I open an issue in cactus to resolve this one?

@stxue1
Copy link
Contributor

stxue1 commented Aug 23, 2024

Should I open an issue in cactus to resolve this one?

At least I don't think there's much we can do on the Toil side for this specific issue. Though my hunch is that this is more of a configuration issue than a toil/cactus issue. The bash shebang at the top of the script should ensure that the file is ran by the referenced executable when ran as a bash script. Does /data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/python3 point to a valid runtime of python? Does a basic python script with the same shebang work? For example by defining a file test.py:

#!/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
print("Hello world")

and running chmod +x ./test.py && ./test.py.

@minglibio
Copy link
Author

I tested, and all things went well. I also tested the _toil_worker, and it works well in the login node.

What I will do is try to figure it out with our cluster manager if it is bc the setting of our cluster. I will let you know if we can solve this problem.

@stxue1
Copy link
Contributor

stxue1 commented Aug 28, 2024

I tested, and all things went well. I also tested the _toil_worker, and it works well in the login node.

It's likely also worth testing if the _toil_worker works from all cluster nodes to check that the path is executable/accessible (or if it is a symlink, maybe it cannot be followed through). (Though from local testing these cases should return some other error, so I do still doubt this is an issue. Perhaps something different may happen on the cluster?)

@minglibio
Copy link
Author

It is bc the system setting of the cluster, all queues on our cluster are set to posix_compliant as shell_start_mode which ignores the #! line. add the -S parameter resolved this issue.
export TOIL_GRIDENGINE_ARGS='-S /data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/python'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants