Skip to content

Video lecture series on building generative ML models for protein sequences from scratch

Notifications You must be signed in to change notification settings

dacarlin/protein-transformers-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Protein language models from scratch

Video lecture series on building generative models of protein sequences. In this series, we create generative models for protein sequences as implemented in the reference implementation of protein transformers on GitHub.

Lecture Description Duration
Lecture 1: Choosing a dataset Selecting a protein sequence dataset suitable for modeling 15 minutes
Lecture 2: A simple protein language model Building a simple protein language model that predicts the next amino acid given the previous three amino acids 30 minutes
Lecture 3: Implementing a protein transformer Building a protein transformer model from scratch 30 minutes
Lecture 4. Evaluating protein language models Adding in evaluations for protein transformer models 1 hour
Lecture 5. Scaling protein language models Modeling all the protein diversity in the UniRef50 dataset by scaling up our models with GPU compute 1 hour

Lectures

Lecture 1. Choosing an appropriate problem for generative modeling of protein sequences

In this lecture, we'll choose a dataset that we think we can model with a protein transformer model. We'll discuss how I chose this dataset and why I think it makes a good example dataset for this tutorial series. We'll dive into the structure and function of the AcyP enzyme and discuss the various aspects of biological symmetry and enzyme function that make this an interesting problem. We'll look at the structure in PyMOL and look at predicted structures of some of our training data to get an idea of what we are interested in having the model learn.

Why I chose this particular family to use for this tutorial:

  • enzyme with functional features (such as catalytic residues)
  • short sequence (efficient to train on laptop)
  • large number of examples in UniProt (including annotated active site)
  • future datasets of functional scans (from Fordyce Lab)

Lecture 2. Building a simple language model

In this lecture, we'll build a simple language model to model the AcyP proteins in our dataset. We'll discuss representing the protein sequences as integer tokens and build a tokenizer. We'll draw from the foundational NLP literature to build a simple multilayer perceptron (MLP) model that predicts the next amino acid given the previous 3. We'll briefly discuss evaluation of protein language models versus natural language models.

Lecture 3. Building a protein language model with a transformer architecture

In this lecture, we'll expand our language model to a decoder-only transformer architecture, building on the work we have done so far to train a model that's capable of generating new proteins like the AcyP homologs in our training set. We'll build out every aspect of the model from scratch (including the multi-head sequence attention mechanism), train the model, and then we'll show how to generate new sequences from the dataset. I'd definitely recommend watching Andrej Karpathy's Makemore series which does an amazing job of building up the transformer model step by step conceptually.

Lecture 4. Adding in evaluations for protein transformer models

In this lecture, we'll start talking about evaluating our model. With our new ability to score the likelihood of a sequence under our model, and generate new sequences, we have new questions. How can we tell if our model is any good? We'll do two main things, we'll ste up an automatic ESMFold pipeline, and we'll also put in a numerical eval of the model's ability to predict the effects of mutations using a dataset from Fordyce lab.

Lecture 5. Scaling training to 150 M params on UniRef50 using Lambda Labs

Using all the tricks to achieve efficient model training, we'll scale up our decoder-only transformer model to the size of the smaller ESM and ProGen models (150 M params) and train on the UniRef50 dataset. To do this, we'll use GPU compute from Lambda Labs.

About

Video lecture series on building generative ML models for protein sequences from scratch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published