Skip to content

5M decoder-only transformer for molecule generation (SMILES), using rotary positional encoding.

Notifications You must be signed in to change notification settings

antoinebcx/molecule-generation-transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MolDecod: Transformer for molecule generation

MolDecod is a 5M-parameter decoder-only transformer for molecule generation (SMILES).

Repository

This repository contains:

  • The trained MolDecod .pth file and SentencePiece tokenizer in /models
  • a Streamlit app to interact with the model in streamlit.py
  • a report with technical details in TechnicalReport.md
  • the notebooks to train, evaluate and use the model in /notebooks
  • some functions to use the model in different ways in /utils

Model characteristics

📖 TechnicalReport.md

MolDecod is a decoder-only transformer model (GPT-like) using rotary positional encoding. The trained model here has a model dimension of 256, 4 attention heads, and 4 decoder layers, resulting in ~5 million parameters.

It was trained on ~2.7M molecules (~90M tokens), from the high-quality MOSES and ChEMBL datasets [+], and achieves an impressive performance for its size.

MolDecod comes with its custom tokenizer, a SentencePiece model trained on the same dataset.

On 10,000 generated molecule for different levels of temperature, it obtains the following metrics:

Temperature Validity Uniqueness Diversity Novelty KL Divergence Fragment Similarity Scaffold Diversity
0.1 1.00 0.04 0.76 0.9455 6.4742 0.0545 0.0148
0.25 1.00 0.49 0.81 0.8347 4.3664 0.1653 0.1398
0.5 0.98 0.95 0.85 0.8768 5.7033 0.1237 0.4556
0.7 0.96 0.95 0.87 0.9240 5.6936 0.0778 0.6540
0.9 0.88 0.88 0.88 0.9562 5.3179 0.0502 0.7524

App

Launch the streamlit app to interact with the model. Download the repo, open a terminal window, install the requirements and run the following command:

streamlit run streamlit.py

 

You can generate molecules from a prompt and visualize their structure and properties:

 

image image image

 

You can also visualize how the tokenizer works on the page "Visualize Tokenization":

image

About

5M decoder-only transformer for molecule generation (SMILES), using rotary positional encoding.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published