Skip to content

A comparison of next-token-prediction-Models for DNA sequences. We provide a new loss function to inform models of DNA 1D-topology.

Notifications You must be signed in to change notification settings

dna-llm/learning-nucleotides

Repository files navigation

nucleotide-model-experiments

Abstract

Recent years have seen a flurry of generative nucleotide models, still mostly of limited utility. In this short paper, we extend the theoretical unification of ecological and evolutionary change by Duthie & Luque to the the problem of synthetic DNA models. Through this extension, we provide, from first principles, methods for training improved models, grouping species as well as creating a road map to scale.

Authors and contributors

The original repo was a private branch hosted on Hassan's account which had the orginals collaborators contributions, so please be aware this repo is not a true quantitative representation of contributions.

Contributers:

  • Hassan Ahmed Hassan
  • Kyle Puhger
  • Ali Saadat
  • Alexander Chen
  • Maximilian Sprang

Project Objectives and downstream plan

By exploring these work packages...

  1. Ideal Loss: What's the ideal loss function for a nucleotide model?Are there trade-offs with regard to model architecture? What sections of a sequence do each of the losses emphasize? Can we combine different losses? The losses in our experiments include but are not limited to Cross Entropy, Reverse Complements, Headless Loss and 2D Loss, as well as Energy based losses such as total, direct and potential energy by nucleotide. We plan to include a wavelet based loss and persistant homology loss. See: “Can we learn to generate viral genomes?” for chart representation of some of those metrics.
  2. Ideal Model Architecture: We are interested in testing different models in combination with our losses. So, the questions above will apply to the different architectures, too. Model types we plan to use include Transformer-based, SSM-based, and mixed Models, as well as convolution-based models such as the Multiresolution Convolutional memory model (MultiresConv).
  3. Ideal Dataset: How much redundancy is there in a genome dataset? What is the optimum learning strategy? How well is topology preserved between different samples and species?
  4. Maximum Scale, optimal parameters: How do each of the previous steps change with scale? Are there clear scaling laws, and can these be applied to get an optimal large foundation model?

...This project aims to:

  • Create a DNA-sequence:natural-language-description dataset of diverse species combining publicly available sequences with their associated texts/research papers.
  • Build homologicaly/topologically optimised DNA models that outperform the current state of the art
  • Build DNA models capable of generating biologically viable whole genomes

Potential downstream applications for Nucleotide-only Language Models (LLMs) include:

  • Encoders for sequence comparisons and classifications
  • Base models for fine-tuned sequence generators and predictors, such as:
    • DNA sequence risk scoring
    • Bacteria specific phage generators
    • Whole genome generators for de novo organisms
    • Antibiotic resistance classifiers based on bacterial plasmid DNA
  • Sequence quality predictors for both small sequences (as found in functional genomics) and large sequences (as found in whole-genome sequencing projects)

Initial results

2D representations of viral genomes visualizes the function space a viral family occupies.

virus-rep Figure 1. All samples of a representative viral family converted to a 2D density plot.

virome Figure 2. All samples of viruses contained within NCBI represented by a 2D density plot.

virus-family Figure 3. All samples of the Norwalk-virus species available in our dataset as of 2024. There is a clear pattern that can be opbserved for all sequences, it deteriorates towards the end, as to the cumulative nature of this representation.

A loss based on 2D-representation works and converges similar to classical losses such as CE.

pythia-panel Figure 4. Exemplary runs of the Transformer Architecture Model Pythia. Model parameter sizes reach from 1.2M to 302M.

2D loss allows the generation of sequences that are more similar to natural ones.

pythia-panel Figure 5. Comparison of generated and natural sequences in the Pythia model trained with different pretraining losses. Natural sequences are colored blue. Seqs generated from Pythia with CE loss orange, complement loss green, headless loss red, 2D and Gaussian distance violet, and 2D and MSE brown.

Datasets

For our initial nucleotide models, we will use the RefSeq dataset:

Type Tokens Size Huggingface
Fungi 18B 5.4GB Fungi Genomes
Bacteria 1368B 402GB Bacteria Genomes Part 1 Bacteria Genomes Part 2 Bacteria Genomes Part 3 Bacteria Genomes Part 4
Invertebrate 369B 108GB Invertebrate Genomes
Mammals 859B 252GB Mammal Genomes Part 1 Mammal Genomes Part 2
Vertebrate Other 867B 255GB Non-mammal Vertebrate Genomes Part 1 Non-mammal Vertebrate Genomes Part 2
Protozoa 3.7B 1GB Protozoa Genomes
Plasmids 6.4B 1.89GB Plasmid Genomes
Plastids 2.1B 0.63GB Plastid Genomes
Archea 5.4B 1.588GB Archea Genomes
Viruses 0.54B 0.161GB Viral Genomes
Plants 299B 88.2GB Plant Genomes
Mitochondrion 0.537B 0.158GB Mitochondrion Genomes
Total 3.8T 1.12TB

In addition to the RefSeq dataset, we'll create a DNA-natural language description dataset. The main reason for this is that in-context learning is a direct result of parallel structure. Therefore, to generate sequences based on natural language input, it is not sufficient to fine-tune the model on a question-answer dataset alone. Instead, we must also encode the desired output structure during the pre-training step.

Prior Work

See Benegas et al. 2024 for a review of all generative models of DNA sequences using a language modelling paradigm for training.

About

A comparison of next-token-prediction-Models for DNA sequences. We provide a new loss function to inform models of DNA 1D-topology.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages