Using Deep Learning to Annotate the Protein Universe

Understanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate 1/3 of microbial protein sequences, hampering our ability to exploit sequences collected from diverse organisms. In this code, i explore an alternative methodology based on deep learning that learns the relationship between unaligned amino acid sequences and their functional annotations across all 17929 families of the Pfam database.

My study focused on only 600 families out of all the families included in the dataset.

Model Architecture

	#Architecture
Model

Result:

	(Training) Accuracy vs Validation Accuracy	(Training) Loss vs Validation Loss
result

Model Evaluation

Notice:

pre-trainde model: https://drive.google.com/file/d/12ZsTkRlEPG8DL50Wb_tdDmHINv9pKTbj/view?usp=share_link

pre-trainde model weights: https://drive.google.com/file/d/1bj4uJBu7rbO6OaIZg--IkOC5yke_WiLn/view?usp=share_link

Tokenizer: https://drive.google.com/file/d/1-01g2VBsa6hMSCRB-DGylfffJDrCRXu4/view?usp=share_link

References:

https://www.biorxiv.org/content/10.1101/626507v4.full.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Using Deep Learning to Annotate the Protein Universe

Model Architecture

Result:

Model Evaluation

Notice:

References:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Using Deep Learning to Annotate the Protein Universe

Model Architecture

Result:

Model Evaluation

Notice:

References: