Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Instanovo as de novo peptide sequencing algorithm #2

Draft
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

BioGeek
Copy link
Contributor

@BioGeek BioGeek commented Nov 25, 2024

Hello,

This PR adds support for the InstaNovo model.

When I tried to run the original code on a free Google Colab instance, I ran out of memory when matching the InstaNovo predictions against the uniprot_sprot.fasta file, so I refactored the code to be able to run it on other platforms. The MLOps platform I have access to is called AIchor, but my modifications should be generic enough to be able to run on any HPC or Kubernetes based infrastructure.

An overview of the changes I made.

  • The original code used pip to install most of the packages, but conda to install sage. I have created an environment.yaml that creates a conda environment with all packages installed.
  • I added a Dockerfile to containerize the application. It uses the previously mentioned environment.yaml to create a conda environment inside the Docker container. This allows to run the code on HPC infrastructure or Kubernetes based platforms.
    The notebooks now detect if they are being run on Google Colab or not. If they are run on Google Colab, they create a conda environment using the environment.yaml. If they are not running on Google Colab we add the conda environment as a ipykernel to the notebook so that we have access to the packages from the conda environment.
  • For reproducibility I added a Makefile. A Makefile is a configuration file for a command line tool used for automation of steps. When run on a system with access to a GPU, you should be able to just do:
    make download_raw_data
    make convert_raw_data_to_mzml
    make run_part1
    make run_part2
    
    to reproduce this study. I used PRIDE dataset PXD027613 (ancient human coprolite spectra) for my experiments, but researchers who want to be able to adapt the pipeline to other datasets only need to change the PRIDE_ID variable in the Makefile.
  • To enforce a consistent coding style I added a .pre-commit-config.yaml file. Pre-commit is a git hook that runs every time you commit code. The rules I've added checks and automatically reformats the code so that it conforms to the PEP8 style guide. There is also a rule that checks for typos (this is what I used for my first PR).
  • I added a .gitignore to avoid uploading the raw or processed data, the Swissprot fasta file or the checkpoints to the repository.
  • For my use case it was easier to work with python files instead of notebooks, so I exported the notebooks to Python files. You can use jupytext to keep the notebooks and the Python files in sync.

Things that still need to be done:

  • Please check if you're happy with all the changes I made or suggest things you want me to change.

  • Please check if you can run the adapted code on your end and if I haven't broken anything.

  • I have only run the code with the default settings. I still need to test running InstaNovo with a custom config file

  • There is a function to filter the Casanovo output. I'm not completely following what's happening there (why don't you just do df = df[df["search_engine_score[1]"] >= 0]?). InstaNovo outputs log probabilities, so it is not clear to me how I should adapt that function.

  • In orthrus_v1/annotated_orthrus_pt2.ipynb, the variable json_file_path = '/content/drive/MyDrive/casanovo/sage/config_general_MQ_fixed_CAM_v1.json' is hardcoded, but that file is not included in the repository. I used this example (Open-Search) config file from the Sage documentation and added it as config.json, but if you used another config file that may influence my results.

  • When ready to merge, the URL to download the enironment.yaml needs to be updated

Command used was:
jupytext --set-formats ipynb,py --sync orthrus_v1/annotated_orthrus_pt1.ipynb
* Move docstrings inside functions
* Apply linting
* Add Makefile
* Add Dockerfile
* add environment.yml to create conda environment
* add manifest.yaml for aichor run
@yc386 yc386 self-assigned this Nov 25, 2024
@BioGeek BioGeek marked this pull request as draft November 25, 2024 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants