MolDecod is a 5M-parameter decoder-only transformer for molecule generation (SMILES).
This repository contains:
- The trained MolDecod .pth file and SentencePiece tokenizer in
/models
- a Streamlit app to interact with the model in
streamlit.py
- a report with technical details in
TechnicalReport.md
- the notebooks to train, evaluate and use the model in
/notebooks
- some functions to use the model in different ways in
/utils
MolDecod is a decoder-only transformer model (GPT-like) using rotary positional encoding. The trained model here has a model dimension of 256, 4 attention heads, and 4 decoder layers, resulting in ~5 million parameters.
It was trained on ~2.7M molecules (~90M tokens), from the high-quality MOSES and ChEMBL datasets [+], and achieves an impressive performance for its size.
MolDecod comes with its custom tokenizer, a SentencePiece model trained on the same dataset.
On 10,000 generated molecule for different levels of temperature, it obtains the following metrics:
Temperature | Validity | Uniqueness | Diversity | Novelty | KL Divergence | Fragment Similarity | Scaffold Diversity |
---|---|---|---|---|---|---|---|
0.1 | 1.00 | 0.04 | 0.76 | 0.9455 | 6.4742 | 0.0545 | 0.0148 |
0.25 | 1.00 | 0.49 | 0.81 | 0.8347 | 4.3664 | 0.1653 | 0.1398 |
0.5 | 0.98 | 0.95 | 0.85 | 0.8768 | 5.7033 | 0.1237 | 0.4556 |
0.7 | 0.96 | 0.95 | 0.87 | 0.9240 | 5.6936 | 0.0778 | 0.6540 |
0.9 | 0.88 | 0.88 | 0.88 | 0.9562 | 5.3179 | 0.0502 | 0.7524 |
Launch the streamlit app to interact with the model. Download the repo, open a terminal window, install the requirements and run the following command:
streamlit run streamlit.py
You can generate molecules from a prompt and visualize their structure and properties:
You can also visualize how the tokenizer works on the page "Visualize Tokenization":