WE4LKD

Word Embeddings For Latent Knowledge Discovery

Accelerating Discoveries in Medicine using Distributed Vector Representations of Words

Berto, Matheus V. V.; De Freitas, Breno L.; Scarton, Carolina E.; Neto, João A. M.; Almeida, Tiago A.

This study aims to extend a recently proposed strategy by combining different unsupervised models to accelerate discoveries in medicine. Distributed vector representations of words were trained on a large corpus of medical papers related to Acute Myeloid Leukemia (AML), a highly malignant form of cancer, and show that established therapies could be developed years before their first proposal. The results open new avenues toward faster medical discoveries through more effective drug and gene testing, enabling better treatments to promote a healthier, prolonged life for patients.

Starting from 1963 - the first explicit occurrence of AML in our corpus - we generated yearly prediction rankings for a set of 21 target compounds. We then calculated the percentage of these predicted AML treatments later reported in the literature, considering only the compounds in the top-3 predictions (orange curve) or not (blue curve). Using only the top-3 predictions would accelerate the percentage discovery of treatments up to 2.3x five years after the first predictions compared to random testing drugs.

Finally, our models were able to identify and suggest testing of some of the currently known compounds used to treat AML up to 11 years before they were explicitly mentioned in the literature, as illustrated below. The remainder of this repository describes the evolution of the project.

Contributing

We encourage you to contribute to our project! Please check out the Issued page.

Built With

Getting Started

This section provides a high-level quick start guide.

Prerequisites

To use this project, you need to have Pyhton installed on your machine. This project used Python version 3.6. In addition, you will also need Pip, the Python package manager to install the other requirements of the project.

Clone the repository

git clone https://github.com/matheusvvb-19/WE4LKD-leukemia_w2v.git
cd WE4LKD-leukemia_w2v/

Setup a Python virtual environment

# create venv
python3 -m venv venv
# activate venv
source venv/bin/activate
# install requirements
pip3 install --ignore-installed -r requirements.txt

Usage

If you like, you can change the search phrases in the /data/search_strings.txt file
Run crawler.py

mkdir results
python3 crawler.py

or download, decompress, and place this file into /pubchem/. If you do this, skip to step 5.

Execute the script merge_txt.py, this will generate the .txt files with all articles between periods

mkdir results_aggregated
python3 merge_txt.py

Execute the script /pubchem/clean_summaries.py, which will clean the merged .txt files

  python3 clean_summaries.py

Train the Word2Vec or FastText incremental models

  cd word2vec
  python3 train_yoy.py

Streamlit web app

To complement this project, we developed two web applications using the Streamlit Python package. The Embeddings Viewer allows users to explore the vector space of our Word2Vec models by searching for specific tokens and analyzing their neighborhood, applying filters to refine the results if necessary.

Acknowledgements

This work was supported by the Brazilian agencies FAPESP (grant 2021/13054-8), Capes, and CNPq. The authors thank Priscila Portela Costa for helping conceptualize this project. We also thank the Computer Science Department from the University of Sheffield for recieving Matheus on his research internship for this project.

Contact

Please do not exitate to contact us by any of the links below.

Matheus Vargas Volpon Berto,
Computer Science B.Sc. student, Federal University of São Carlos (UFSCar), Sorocaba, Brazil.

Citation

@article{BERTO2024123566,
  title = {Accelerating discoveries in medicine using distributed vector representations of words},
  journal = {Expert Systems with Applications},
  volume = {250},
  pages = {123566},
  year = {2024},
  issn = {0957-4174},
  doi = {https://doi.org/10.1016/j.eswa.2024.123566},
  url = {https://www.sciencedirect.com/science/article/pii/S0957417424004317},
  author = {Matheus V.V. Berto and Breno L. Freitas and Carolina Scarton and João A. Machado-Neto and Tiago A. Almeida},
}

References

"Unsupervised word embeddings capture latent knowledge from materials science literature", Nature 571, 95–98 (2019)

⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 879 Commits
bert		bert
data		data
ner		ner
pages		pages
pubchem		pubchem
word2vec		word2vec
Home.py		Home.py
README.md		README.md
__init__.py		__init__.py
crawler.py		crawler.py
generate_analogies_aml.py		generate_analogies_aml.py
generate_dotproducts_csv.py		generate_dotproducts_csv.py
generate_dotproducts_csv_falirnlp.py		generate_dotproducts_csv_falirnlp.py
get_n_common_words_english.py		get_n_common_words_english.py
latent_knowledge_report.py		latent_knowledge_report.py
latent_knowledge_template.tex		latent_knowledge_template.tex
merge_txt.py		merge_txt.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WE4LKD

Accelerating Discoveries in Medicine using Distributed Vector Representations of Words

Table of Contents

Contributing

Built With

Getting Started

Prerequisites

Usage

Streamlit web app

Acknowledgements

Contact

Citation

References

About

Releases

Contributors 4

Languages

UFSCar-LaSID/WE4LKD-leukemia_w2v

Folders and files

Latest commit

History

Repository files navigation

WE4LKD

Accelerating Discoveries in Medicine using Distributed Vector Representations of Words

Table of Contents

Contributing

Built With

Getting Started

Prerequisites

Usage

Streamlit web app

Acknowledgements

Contact

Citation

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Contributors 4

Languages