MolDecod: Transformer for molecule generation

MolDecod is a 5M-parameter decoder-only transformer for molecule generation (SMILES).

Repository

This repository contains:

The trained MolDecod .pth file and SentencePiece tokenizer in /models
a Streamlit app to interact with the model in streamlit.py
a report with technical details in TechnicalReport.md
the notebooks to train, evaluate and use the model in /notebooks
some functions to use the model in different ways in /utils

Model characteristics

MolDecod is a decoder-only transformer model (GPT-like) using rotary positional encoding. The trained model here has a model dimension of 256, 4 attention heads, and 4 decoder layers, resulting in ~5 million parameters.

It was trained on ~2.7M molecules (~90M tokens), from the high-quality MOSES and ChEMBL datasets [+], and achieves an impressive performance for its size.

MolDecod comes with its custom tokenizer, a SentencePiece model trained on the same dataset.

On 10,000 generated molecule for different levels of temperature, it obtains the following metrics:

Temperature	Validity	Uniqueness	Diversity	Novelty	KL Divergence	Fragment Similarity	Scaffold Diversity
0.1	1.00	0.04	0.76	0.9455	6.4742	0.0545	0.0148
0.25	1.00	0.49	0.81	0.8347	4.3664	0.1653	0.1398
0.5	0.98	0.95	0.85	0.8768	5.7033	0.1237	0.4556
0.7	0.96	0.95	0.87	0.9240	5.6936	0.0778	0.6540
0.9	0.88	0.88	0.88	0.9562	5.3179	0.0502	0.7524

App

Launch the streamlit app to interact with the model. Download the repo, open a terminal window, install the requirements and run the following command:

streamlit run streamlit.py

You can generate molecules from a prompt and visualize their structure and properties:

You can also visualize how the tokenizer works on the page "Visualize Tokenization":

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
models		models
notebooks		notebooks
pages		pages
utils		utils
README.md		README.md
TechnicalReport.md		TechnicalReport.md
requirements.txt		requirements.txt
streamlit.py		streamlit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MolDecod: Transformer for molecule generation

Repository

Model characteristics

App

About

Releases

Packages

Languages

antoinebcx/molecule-generation-transformer

Folders and files

Latest commit

History

Repository files navigation

MolDecod: Transformer for molecule generation

Repository

Model characteristics

App

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages