Recent years have seen a flurry of generative nucleotide models, still mostly of limited utility. In this short paper, we extend the theoretical unification of ecological and evolutionary change by Duthie & Luque to the the problem of synthetic DNA models. Through this extension, we provide, from first principles, methods for training improved models, grouping species as well as creating a road map to scale.
The original repo was a private branch hosted on Hassan's account which had the orginals collaborators contributions, so please be aware this repo is not a true quantitative representation of contributions.
Contributers:
- Hassan Ahmed Hassan
- Kyle Puhger
- Ali Saadat
- Alexander Chen
- Maximilian Sprang
By exploring these work packages...
- Ideal Loss: What's the ideal loss function for a nucleotide model?Are there trade-offs with regard to model architecture? What sections of a sequence do each of the losses emphasize? Can we combine different losses? The losses in our experiments include but are not limited to Cross Entropy, Reverse Complements, Headless Loss and 2D Loss, as well as Energy based losses such as total, direct and potential energy by nucleotide. We plan to include a wavelet based loss and persistant homology loss. See: “Can we learn to generate viral genomes?” for chart representation of some of those metrics.
- Ideal Model Architecture: We are interested in testing different models in combination with our losses. So, the questions above will apply to the different architectures, too. Model types we plan to use include Transformer-based, SSM-based, and mixed Models, as well as convolution-based models such as the Multiresolution Convolutional memory model (MultiresConv).
- Ideal Dataset: How much redundancy is there in a genome dataset? What is the optimum learning strategy? How well is topology preserved between different samples and species?
- Maximum Scale, optimal parameters: How do each of the previous steps change with scale? Are there clear scaling laws, and can these be applied to get an optimal large foundation model?
...This project aims to:
- Create a DNA-sequence:natural-language-description dataset of diverse species combining publicly available sequences with their associated texts/research papers.
- Build homologicaly/topologically optimised DNA models that outperform the current state of the art
- Build DNA models capable of generating biologically viable whole genomes
Potential downstream applications for Nucleotide-only Language Models (LLMs) include:
- Encoders for sequence comparisons and classifications
- Base models for fine-tuned sequence generators and predictors, such as:
- DNA sequence risk scoring
- Bacteria specific phage generators
- Whole genome generators for de novo organisms
- Antibiotic resistance classifiers based on bacterial plasmid DNA
- Sequence quality predictors for both small sequences (as found in functional genomics) and large sequences (as found in whole-genome sequencing projects)
Figure 1. All samples of a representative viral family converted to a 2D density plot.
Figure 2. All samples of viruses contained within NCBI represented by a 2D density plot.
Figure 3. All samples of the Norwalk-virus species available in our dataset as of 2024. There is a clear pattern that can be opbserved for all sequences, it deteriorates towards the end, as to the cumulative nature of this representation.
Figure 4. Exemplary runs of the Transformer Architecture Model Pythia. Model parameter sizes reach from 1.2M to 302M.
Figure 5. Comparison of generated and natural sequences in the Pythia model trained with different pretraining losses. Natural sequences are colored blue. Seqs generated from Pythia with CE loss orange, complement loss green, headless loss red, 2D and Gaussian distance violet, and 2D and MSE brown.
For our initial nucleotide models, we will use the RefSeq dataset:
Type | Tokens | Size | Huggingface |
---|---|---|---|
Fungi | 18B | 5.4GB | Fungi Genomes |
Bacteria | 1368B | 402GB | Bacteria Genomes Part 1 Bacteria Genomes Part 2 Bacteria Genomes Part 3 Bacteria Genomes Part 4 |
Invertebrate | 369B | 108GB | Invertebrate Genomes |
Mammals | 859B | 252GB | Mammal Genomes Part 1 Mammal Genomes Part 2 |
Vertebrate Other | 867B | 255GB | Non-mammal Vertebrate Genomes Part 1 Non-mammal Vertebrate Genomes Part 2 |
Protozoa | 3.7B | 1GB | Protozoa Genomes |
Plasmids | 6.4B | 1.89GB | Plasmid Genomes |
Plastids | 2.1B | 0.63GB | Plastid Genomes |
Archea | 5.4B | 1.588GB | Archea Genomes |
Viruses | 0.54B | 0.161GB | Viral Genomes |
Plants | 299B | 88.2GB | Plant Genomes |
Mitochondrion | 0.537B | 0.158GB | Mitochondrion Genomes |
Total | 3.8T | 1.12TB |
In addition to the RefSeq dataset, we'll create a DNA-natural language description dataset. The main reason for this is that in-context learning is a direct result of parallel structure. Therefore, to generate sequences based on natural language input, it is not sufficient to fine-tune the model on a question-answer dataset alone. Instead, we must also encode the desired output structure during the pre-training step.
See Benegas et al. 2024 for a review of all generative models of DNA sequences using a language modelling paradigm for training.