This respository demos how to implement molecule generative model using Hugging Face Library. We took a reference at Hugging Face notebook and modify it to current version. This respository may bring you a quick overview using few codes.
Note that we create another respository about more gerenal version about training a language-based generative model in scripting. Please take a look at HfMolGen.
pip install -r requirements.txt
We have create an empty folder ./dataset
, please upload your datasets in this folder.
In this respository we have implemented atom-based and BPE tokenizer. And we use pre-trained BPE vocabulary file and merge file from SmilesPE. You may override this two files if you needed.
All details about loading from dataset, setting up tokenizer, and building up training are in training.ipynb
.
Please take a look at generation.ipynb
.
You may train the model using run_clm-4.8.0.sh
script, and please modify run_clm-4.8.0.py
and run_clm-4.8.0.sh
if you needed. Note that if you need to use BPE encoding, please merge BPE encoding class (in training.ipynb
) to run_clm-4.8.0.sh
.
bash run_clm-4.8.0.sh