Search-Engine-for-COVID-19-literature

A Semantic Search Engine for research papers on COVID-19 in various categories such as symptoms, influential factors, similar diseases and viruses etc.
This is a solution to the CORD-19 challenge on Kaggle. The dataset was created in response to the COVID-19 pandemic containing over 500,000 scholarly articles, including over 200,000 with full text about COVID-19, SARS-CoV-2, and related coronaviruses.

COVID-19 Open Research Dataset Challenge (CORD-19)

The challenge was to build a search engine/data mining tool that can accurately develop answers to high priority scientific questions in this domain. The size of the dataset is 46.71 GB as of now and more literature is periodically added. The dataset can be found here and here.
The traditional approach is key-word based search using metrics such as TF-IDF or BM25. Although these methods do a solid job in providing good results, they fail to consider the sequence of words in the query. Moreover, with the increase in vocabulary, the vector size increases as well(one can use sparse vectors to overcome this problem). However, the major drawback of these approaches is that they fail to incorporate the semantics of the query or data.

To overcome this, word/sentence embeddings can be used that capture the meaning of the query much more accurately.

To use word embeddings, the query and data is represented by a weighted sum of the different word embeddings in a sentence(BM25 can be used to weigh the words).
However, for sentence embeddings, no such processing needs to be done as the vectors already contain semantics of the entire text.

Our solution uses sentence embeddings to encode the query and data and then compute the similarity scores between the two to rank the documents. To generate sentence embeddings, we use the BioBERT model - a pre-trained biomedical language representation model for biomedical text mining which produces 768-dimensional vectors. The model can be found here and the corresponding paper here. We use the Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co. framework to load the BioBERT model.

The main issue that decreases the efficiency of search engines is the time taken to compute the similarity scores of the entire dataset with respect to the query. To overcome this issue, we use Faiss, a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to the ones that possibly do not fit in the RAM. It builds an index over the entire dataset and efficiently returns the top 'n' results.

Question	Search Results
What do models for transmission predict?
What is the longest duration of viral shedding?
Effectiveness of case isolation/isolation of exposed individuals

All the above questions are part of the CORD-19 challenge

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
db_models		db_models
notebooks		notebooks
search		search
static		static
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search-Engine-for-COVID-19-literature

About

Releases

Packages

Contributors 2

Languages

License

arnavsshah/Semantic-Search-Engine-for-COVID-19-literature

Folders and files

Latest commit

History

Repository files navigation

Search-Engine-for-COVID-19-literature

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages