Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties

About

This repository contains code for "Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties", published in Molecular Informatics . The accompanying materials are available in Zenodo .

Installation

To use the scripts, clone the repository and install the required dependencies:

git clone https://github.com/boun-tabi/exploring-chemical-words.git
cd exploring-chemical-words
pip install -r requirements.txt

Usage

Identifying chemical vocabularies

Train subword tokenization models to identify chemical vocabularies.

python word_identification.py --model_type [model_type] --corpus [corpus_path] --save_name [save_name] --vocab_size [vocab_size]

--model_type: Choose from 'bpe', 'unigram', 'wordpiece'
--corpus: Filepath of the corpus containing SMILES strings
--save_name: Filename for the output vocabulary file
--vocab_size: Desired size of the vocabulary

Selecting key chemical words

Identify significant words in chemical documents using the specified vocabulary.

python highlighter.py --dataset [dataset_name] --vocabulary [vocabulary_name]

--dataset: Specify dataset name (e.g., 'lit_pcba', 'bdb', or others)
--vocabulary: Name or path of the vocabulary file

Computing chemical vocabulary statistics

Perform a comprehensive analysis of chemical words, deriving key statistics and insights.

python analyzer.py --dataset [dataset_name] --vocabulary [vocabulary_name]

--dataset: Choose the dataset (e.g., 'lit_pcba', 'bdb', or others)
--vocabulary: Name or path of the vocabulary file

Streamlit app

Launch an interactive Streamlit application illustrating the key chemical words for particular targets along with associated binders and drugs.

streamlit run app.py

Citation

@article{https://doi.org/10.1002/minf.202300249,
    author = {Temizer, Asu Busra and Uludoğan, Gökçe and Özçelik, Rıza and Koulani, Taha and Ozkirimli, Elif and Ulgen, Kutlu O. and Karali, Nilgun and Özgür, Arzucan},
    title = {Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties},
    journal = {Molecular Informatics},
    doi = {https://doi.org/10.1002/minf.202300249},
    url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/minf.202300249},
    eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/minf.202300249},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties

About

Installation

Usage

Identifying chemical vocabularies

Selecting key chemical words

Computing chemical vocabulary statistics

Streamlit app

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties

About

Installation

Usage

Identifying chemical vocabularies

Selecting key chemical words

Computing chemical vocabulary statistics

Streamlit app

Citation