-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Instanovo as de novo peptide sequencing algorithm #2
Draft
BioGeek
wants to merge
38
commits into
yc386:main
Choose a base branch
from
BioGeek:instanovo
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ith the same name
Command used was: jupytext --set-formats ipynb,py --sync orthrus_v1/annotated_orthrus_pt1.ipynb
* Move docstrings inside functions * Apply linting * Add Makefile * Add Dockerfile * add environment.yml to create conda environment * add manifest.yaml for aichor run
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello,
This PR adds support for the InstaNovo model.
When I tried to run the original code on a free Google Colab instance, I ran out of memory when matching the InstaNovo predictions against the
uniprot_sprot.fasta
file, so I refactored the code to be able to run it on other platforms. The MLOps platform I have access to is called AIchor, but my modifications should be generic enough to be able to run on any HPC or Kubernetes based infrastructure.An overview of the changes I made.
pip
to install most of the packages, butconda
to installsage
. I have created anenvironment.yaml
that creates a conda environment with all packages installed.Dockerfile
to containerize the application. It uses the previously mentionedenvironment.yaml
to create aconda
environment inside the Docker container. This allows to run the code on HPC infrastructure or Kubernetes based platforms.The notebooks now detect if they are being run on Google Colab or not. If they are run on Google Colab, they create a
conda
environment using theenvironment.yaml
. If they are not running on Google Colab we add the conda environment as aipykernel
to the notebook so that we have access to the packages from theconda
environment.Makefile
. A Makefile is a configuration file for a command line tool used for automation of steps. When run on a system with access to a GPU, you should be able to just do:PXD027613
(ancient human coprolite spectra) for my experiments, but researchers who want to be able to adapt the pipeline to other datasets only need to change thePRIDE_ID
variable in the Makefile..pre-commit-config.yaml
file. Pre-commit is a git hook that runs every time you commit code. The rules I've added checks and automatically reformats the code so that it conforms to the PEP8 style guide. There is also a rule that checks for typos (this is what I used for my first PR)..gitignore
to avoid uploading the raw or processed data, the Swissprot fasta file or the checkpoints to the repository.jupytext
to keep the notebooks and the Python files in sync.Things that still need to be done:
Please check if you're happy with all the changes I made or suggest things you want me to change.
Please check if you can run the adapted code on your end and if I haven't broken anything.
I have only run the code with the default settings. I still need to test running InstaNovo with a custom config file
There is a function to filter the Casanovo output. I'm not completely following what's happening there (why don't you just do
df = df[df["search_engine_score[1]"] >= 0]
?). InstaNovo outputs log probabilities, so it is not clear to me how I should adapt that function.In
orthrus_v1/annotated_orthrus_pt2.ipynb
, the variablejson_file_path = '/content/drive/MyDrive/casanovo/sage/config_general_MQ_fixed_CAM_v1.json'
is hardcoded, but that file is not included in the repository. I used this example (Open-Search) config file from the Sage documentation and added it asconfig.json
, but if you used another config file that may influence my results.When ready to merge, the URL to download the
enironment.yaml
needs to be updated