Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties
This repository contains code for "Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties", published in Molecular Informatics . The accompanying materials are available in Zenodo .
To use the scripts, clone the repository and install the required dependencies:
git clone https://github.com/boun-tabi/exploring-chemical-words.git
cd exploring-chemical-words
pip install -r requirements.txt
Train subword tokenization models to identify chemical vocabularies.
python word_identification.py --model_type [model_type] --corpus [corpus_path] --save_name [save_name] --vocab_size [vocab_size]
--model_type
: Choose from 'bpe', 'unigram', 'wordpiece'--corpus
: Filepath of the corpus containing SMILES strings--save_name
: Filename for the output vocabulary file--vocab_size
: Desired size of the vocabulary
Identify significant words in chemical documents using the specified vocabulary.
python highlighter.py --dataset [dataset_name] --vocabulary [vocabulary_name]
--dataset
: Specify dataset name (e.g., 'lit_pcba', 'bdb', or others)--vocabulary
: Name or path of the vocabulary file
Perform a comprehensive analysis of chemical words, deriving key statistics and insights.
python analyzer.py --dataset [dataset_name] --vocabulary [vocabulary_name]
--dataset
: Choose the dataset (e.g., 'lit_pcba', 'bdb', or others)--vocabulary
: Name or path of the vocabulary file
Launch an interactive Streamlit application illustrating the key chemical words for particular targets along with associated binders and drugs.
streamlit run app.py
@article{https://doi.org/10.1002/minf.202300249,
author = {Temizer, Asu Busra and Uludoğan, Gökçe and Özçelik, Rıza and Koulani, Taha and Ozkirimli, Elif and Ulgen, Kutlu O. and Karali, Nilgun and Özgür, Arzucan},
title = {Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties},
journal = {Molecular Informatics},
doi = {https://doi.org/10.1002/minf.202300249},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/minf.202300249},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/minf.202300249},
}