nucleotide-model-experiments

Abstract
Project Objectives and downstream plan
First results
Datasets
Prior Work

Abstract

Recent years have seen a flurry of generative nucleotide models, still mostly of limited utility. In this short paper, we extend the theoretical unification of ecological and evolutionary change by Duthie & Luque to the the problem of synthetic DNA models. Through this extension, we provide, from first principles, methods for training improved models, grouping species as well as creating a road map to scale.

Authors and contributors

The original repo was a private branch hosted on Hassan's account which had the orginals collaborators contributions, so please be aware this repo is not a true quantitative representation of contributions.

Contributers:

Hassan Ahmed Hassan
Kyle Puhger
Ali Saadat
Alexander Chen
Maximilian Sprang

Project Objectives and downstream plan

By exploring these work packages...

Ideal Loss: What's the ideal loss function for a nucleotide model?Are there trade-offs with regard to model architecture? What sections of a sequence do each of the losses emphasize? Can we combine different losses? The losses in our experiments include but are not limited to Cross Entropy, Reverse Complements, Headless Loss and 2D Loss, as well as Energy based losses such as total, direct and potential energy by nucleotide. We plan to include a wavelet based loss and persistant homology loss. See: “Can we learn to generate viral genomes?” for chart representation of some of those metrics.
Ideal Model Architecture: We are interested in testing different models in combination with our losses. So, the questions above will apply to the different architectures, too. Model types we plan to use include Transformer-based, SSM-based, and mixed Models, as well as convolution-based models such as the Multiresolution Convolutional memory model (MultiresConv).
Ideal Dataset: How much redundancy is there in a genome dataset? What is the optimum learning strategy? How well is topology preserved between different samples and species?
Maximum Scale, optimal parameters: How do each of the previous steps change with scale? Are there clear scaling laws, and can these be applied to get an optimal large foundation model?

...This project aims to:

Create a DNA-sequence:natural-language-description dataset of diverse species combining publicly available sequences with their associated texts/research papers.
Build homologicaly/topologically optimised DNA models that outperform the current state of the art
Build DNA models capable of generating biologically viable whole genomes

Potential downstream applications for Nucleotide-only Language Models (LLMs) include:

Encoders for sequence comparisons and classifications
Base models for fine-tuned sequence generators and predictors, such as:
- DNA sequence risk scoring
- Bacteria specific phage generators
- Whole genome generators for de novo organisms
- Antibiotic resistance classifiers based on bacterial plasmid DNA
Sequence quality predictors for both small sequences (as found in functional genomics) and large sequences (as found in whole-genome sequencing projects)

Initial results

2D representations of viral genomes visualizes the function space a viral family occupies.

Figure 1. All samples of a representative viral family converted to a 2D density plot.

Figure 2. All samples of viruses contained within NCBI represented by a 2D density plot.

Figure 3. All samples of the Norwalk-virus species available in our dataset as of 2024. There is a clear pattern that can be opbserved for all sequences, it deteriorates towards the end, as to the cumulative nature of this representation.

A loss based on 2D-representation works and converges similar to classical losses such as CE.

Figure 4. Exemplary runs of the Transformer Architecture Model Pythia. Model parameter sizes reach from 1.2M to 302M.

2D loss allows the generation of sequences that are more similar to natural ones.

Figure 5. Comparison of generated and natural sequences in the Pythia model trained with different pretraining losses. Natural sequences are colored blue. Seqs generated from Pythia with CE loss orange, complement loss green, headless loss red, 2D and Gaussian distance violet, and 2D and MSE brown.

Datasets

For our initial nucleotide models, we will use the RefSeq dataset:

Type	Tokens	Size	Huggingface
Fungi	18B	5.4GB	Fungi Genomes
Bacteria	1368B	402GB	Bacteria Genomes Part 1 Bacteria Genomes Part 2 Bacteria Genomes Part 3 Bacteria Genomes Part 4
Invertebrate	369B	108GB	Invertebrate Genomes
Mammals	859B	252GB	Mammal Genomes Part 1 Mammal Genomes Part 2
Vertebrate Other	867B	255GB	Non-mammal Vertebrate Genomes Part 1 Non-mammal Vertebrate Genomes Part 2
Protozoa	3.7B	1GB	Protozoa Genomes
Plasmids	6.4B	1.89GB	Plasmid Genomes
Plastids	2.1B	0.63GB	Plastid Genomes
Archea	5.4B	1.588GB	Archea Genomes
Viruses	0.54B	0.161GB	Viral Genomes
Plants	299B	88.2GB	Plant Genomes
Mitochondrion	0.537B	0.158GB	Mitochondrion Genomes
Total	3.8T	1.12TB

In addition to the RefSeq dataset, we'll create a DNA-natural language description dataset. The main reason for this is that in-context learning is a direct result of parallel structure. Therefore, to generate sequences based on natural language input, it is not sufficient to fine-tune the model on a question-answer dataset alone. Instead, we must also encode the desired output structure during the pre-training step.

Prior Work

See Benegas et al. 2024 for a review of all generative models of DNA sequences using a language modelling paradigm for training.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
docker/experiment_one		docker/experiment_one
experiments		experiments
images		images
notebooks		notebooks
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nucleotide-model-experiments

Abstract

Authors and contributors

Project Objectives and downstream plan

Initial results

2D representations of viral genomes visualizes the function space a viral family occupies.

A loss based on 2D-representation works and converges similar to classical losses such as CE.

2D loss allows the generation of sequences that are more similar to natural ones.

Datasets

Prior Work

About

Releases

Packages

Contributors 3

Languages

dna-llm/learning-nucleotides

Folders and files

Latest commit

History

Repository files navigation

nucleotide-model-experiments

Abstract

Authors and contributors

Project Objectives and downstream plan

Initial results

2D representations of viral genomes visualizes the function space a viral family occupies.

A loss based on 2D-representation works and converges similar to classical losses such as CE.

2D loss allows the generation of sequences that are more similar to natural ones.

Datasets

Prior Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages