Add Instanovo as de novo peptide sequencing algorithm #2

BioGeek · 2024-11-25T11:27:45Z

Hello,

This PR adds support for the InstaNovo model.

When I tried to run the original code on a free Google Colab instance, I ran out of memory when matching the InstaNovo predictions against the uniprot_sprot.fasta file, so I refactored the code to be able to run it on other platforms. The MLOps platform I have access to is called AIchor, but my modifications should be generic enough to be able to run on any HPC or Kubernetes based infrastructure.

An overview of the changes I made.

The original code used pip to install most of the packages, but conda to install sage. I have created an environment.yaml that creates a conda environment with all packages installed.
I added a Dockerfile to containerize the application. It uses the previously mentioned environment.yaml to create a conda environment inside the Docker container. This allows to run the code on HPC infrastructure or Kubernetes based platforms.
The notebooks now detect if they are being run on Google Colab or not. If they are run on Google Colab, they create a conda environment using the environment.yaml. If they are not running on Google Colab we add the conda environment as a ipykernel to the notebook so that we have access to the packages from the conda environment.
For reproducibility I added a Makefile. A Makefile is a configuration file for a command line tool used for automation of steps. When run on a system with access to a GPU, you should be able to just do:
```
make download_raw_data
make convert_raw_data_to_mzml
make run_part1
make run_part2
```
to reproduce this study. I used PRIDE dataset PXD027613 (ancient human coprolite spectra) for my experiments, but researchers who want to be able to adapt the pipeline to other datasets only need to change the PRIDE_ID variable in the Makefile.
To enforce a consistent coding style I added a .pre-commit-config.yaml file. Pre-commit is a git hook that runs every time you commit code. The rules I've added checks and automatically reformats the code so that it conforms to the PEP8 style guide. There is also a rule that checks for typos (this is what I used for my first PR).
I added a .gitignore to avoid uploading the raw or processed data, the Swissprot fasta file or the checkpoints to the repository.
For my use case it was easier to work with python files instead of notebooks, so I exported the notebooks to Python files. You can use jupytext to keep the notebooks and the Python files in sync.

Things that still need to be done:

Please check if you're happy with all the changes I made or suggest things you want me to change.
Please check if you can run the adapted code on your end and if I haven't broken anything.
I have only run the code with the default settings. I still need to test running InstaNovo with a custom config file
There is a function to filter the Casanovo output. I'm not completely following what's happening there (why don't you just do df = df[df["search_engine_score[1]"] >= 0]?). InstaNovo outputs log probabilities, so it is not clear to me how I should adapt that function.
In orthrus_v1/annotated_orthrus_pt2.ipynb, the variable json_file_path = '/content/drive/MyDrive/casanovo/sage/config_general_MQ_fixed_CAM_v1.json' is hardcoded, but that file is not included in the repository. I used this example (Open-Search) config file from the Sage documentation and added it as config.json, but if you used another config file that may influence my results.
When ready to merge, the URL to download the enironment.yaml needs to be updated

…ith the same name

Command used was: jupytext --set-formats ipynb,py --sync orthrus_v1/annotated_orthrus_pt1.ipynb

* Move docstrings inside functions * Apply linting * Add Makefile * Add Dockerfile * add environment.yml to create conda environment * add manifest.yaml for aichor run

BioGeek added 30 commits November 18, 2024 21:26

The mztab imported from pyteomics was shadowed by a loop variable w…

64785eb

…ith the same name

Start adding InstaNovo

024b1bb

start adapting functions

d34bd4e

Add Dockerfile and environment.yml for containerization

72e9019

Pair notebook with python script using jupytext

077f93c

Command used was: jupytext --set-formats ipynb,py --sync orthrus_v1/annotated_orthrus_pt1.ipynb

Add more pre-commit hooks

9995412

exp: convert notebook to python file

7aaaa16

* Move docstrings inside functions * Apply linting * Add Makefile * Add Dockerfile * add environment.yml to create conda environment * add manifest.yaml for aichor run

exp: Add make and curl to Dockerfile

eb732a0

exp: Add Makefile to Docker image, fix missing argument

ae7808f

exp: Use other GPU, fix bugs

1a58621

exp: Add S3 bucket upload/download functionality

82cc5bc

exp: add missing unzip package to Dockerfile

4f3b2c4

exp: fix bucket path in Makefile

d9f0d67

exp: fix command in manifest.yaml

6773f81

test if running on Colab

8e0006e

exp: changes to make it work on Google Colab again

bec3e09

Use run_command instead of os.system

3198506

exp: run prediction on a single file

35d3be5

exp: run Casanovo

a96a131

exp: explicitly set S3 endpoint

34c1fd8

Update documentation

f33096b

exp: test uploading to bucket

79186e3

exp: add s3fs to conda environment

027ba52

exp: test writing larger and larger files

4d1d45f

exp: log to stdout

bfb077a

exp: list multipart uploads

e543d96

exp: rerun experiment

38a48d2

exp: 'S3FileSystem' object has no attribute 'list_multipart_uploads'

168c10a

exp: zip file before uplaoding to bucket

3cec9c3

exp: use os.path.basename instead of zip_path

454643c

BioGeek added 8 commits November 25, 2024 01:25

exp: full PXD027613 run (part 1)

c383ab2

exp: full PXD027613 run (part 1 - Casanovo)

cfbebeb

Adapt part 2

3242a84

sync notebooks and python files

e22dca1

restore Google Colab specific formatting

b141334

Update pre-commit rules

4d9af68

Remove AIchor specific code

6bdc18e

Fix Colab specific formatting again

f01095b

yc386 self-assigned this Nov 25, 2024

BioGeek marked this pull request as draft November 25, 2024 14:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Instanovo as de novo peptide sequencing algorithm #2

Add Instanovo as de novo peptide sequencing algorithm #2

BioGeek commented Nov 25, 2024 •

edited

Loading

Add Instanovo as de novo peptide sequencing algorithm #2

Are you sure you want to change the base?

Add Instanovo as de novo peptide sequencing algorithm #2

Conversation

BioGeek commented Nov 25, 2024 • edited Loading

BioGeek commented Nov 25, 2024 •

edited

Loading