Repository of Codon-based Foundation Models (EnCodon & DeCodon)

This repository contains the code for the EnCodon and DeCodon models, codon-resolution large language models pre-trained on the NCBI Genomes database described in the paper "A Suite of Foundation Models Captures the Contextual Interplay Between Codons".

Get started 🚀

Installation

From source

Currently, this is the only way to install the package but will push a pip installable version soon. To install the package from source, run the following command:

pip install git+https://github.com/goodarzilab/cdsFM.git

Applications

Now that you have cdsFM installed, you can use AutoEnCodon and AutoDeCodon classes which serve as wrappers around the pre-trained models. Here are some examples on how to use them:

Sequence Embedding Extraction with EnCodon

Following is an example of how to use the EnCodon model to extract sequence embeddings:

from cdsFM import AutoEnCodon

# Load your dataframe containing sequences
seqs = ...

# Load a pre-trained EnCodon model
model = AutoEnCodon.from_pretrained("goodarzilab/encodon-620M")

# Extract embeddings
embeddings = model.get_embeddings(seqs, batch_size=32)

Sequence Generation with DeCodon

You can generate organism-specific coding sequences with DeCodon simply by:

from cdsFM import AutoDeCodon

# Load a pre-trained DeCodon model
model = AutoDeCodon.from_pretrained("goodarzilab/DeCodon-200M")

# Generate!
gen_seqs = model.generate(
    taxid=9606, # NCBI Taxonomy ID for Homo sapiens
    num_return_sequences=32, # Number of sequences to return
    max_length=1024, # Maximum length of the generated sequence
    batch_size=8, # Batch size for generation
)

Tokenization

EnCodon and DeCodon are pre-trained on coding sequences of length up to 2048 codons (i.e. 6144 nucleotides), including the <CLS> token prepended automatically to the beginning of the sequence and the <SEP> token appended at the end. The tokenizer's vocabulary consists of 64 codons and 5 special tokens namely <CLS>, <SEP>, <PAD>, <MASK> and <UNK>.

HuggingFace 🤗

A collection of pre-trained checkpoints of EnCodon & DeCodon models are available on HuggingFace 🤗. Following table contains the list of available models:

Model	name	num. params	description	weights
EnCodon	encodon-80M	80M	Pre-trained checkpoint	🤗
EnCodon	encodon-80M-euk	80M	Eukaryotic-expert	🤗
EnCodon	encodon-620M	620M	Pre-trained checkpoint	🤗
EnCodon	encodon-620M-euk	620M	Eukaryotic-expert	🤗
DeCodon	decodon-200M	200M	Pre-trained checkpoint	🤗
DeCodon	decodon-200M-euk	200M	Eukaryotic-expert	🤗

Citation

@article{Naghipourfar2024,
  title = {A Suite of Foundation Models Captures the Contextual Interplay Between Codons},
  url = {http://dx.doi.org/10.1101/2024.10.10.617568},
  DOI = {10.1101/2024.10.10.617568},
  publisher = {Cold Spring Harbor Laboratory},
  author = {Naghipourfar,  Mohsen and Chen,  Siyu and Howard,  Mathew and Macdonald,  Christian and Saberi,  Ali and Hagen,  Timo and Mofrad,  Mohammad and Coyote-Maestas,  Willow and Goodarzi,  Hani},
  year = {2024},
  month = oct 
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
apps		apps
assets		assets
cdsfm		cdsfm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repository of Codon-based Foundation Models (EnCodon & DeCodon)

Get started 🚀

Installation

From source

Applications

Sequence Embedding Extraction with EnCodon

Sequence Generation with DeCodon

Tokenization

HuggingFace 🤗

Citation

About

Releases

Packages

Languages

License

goodarzilab/cdsFM

Folders and files

Latest commit

History

Repository files navigation

Repository of Codon-based Foundation Models (EnCodon & DeCodon)

Get started 🚀

Installation

From source

Applications

Sequence Embedding Extraction with EnCodon

Sequence Generation with DeCodon

Tokenization

HuggingFace 🤗

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages