The coronavirus pandemic caused enormous health, economic, environmental, and social challenges to the entire human population. The entire research community worked tirelessly for a vaccine but could we help speeding up these efforts even more?
In response to the COVID-19 pandemic, the White House and a coalition of leading research groups prepared a COVID-19 Open Research Dataset (CORD-19). It is a resource of over 1 million scholarly articles, including over 400,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset was provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease.
This project aims to help researchers navigate this fast-growing body of coronavirus literature to efficiently find relevant and up-to-date information. This is done by using various topic modeling algorithms to cluster similar papers together. We leverage Hadoop for data storage management and PySpark for building ML and DL pipelines.
Dataset consists of JSON and CSV files. Each paper is saved in a nested JSON file while some additional metadata is available in a CSV file. A detailed description is available here. Below image summarizes the data preprocessing pipeline.
- Graph databases provide a way to generate and visualize relationships between entities
- Both Pyspark GraphFrame and neo4j can achieve graph-based data storage. We explored both the tools
- Each author, paper, and journal acts as a node
- All nodes are connected as per relationships – “has_published” or “has_paper”
- Data was prepared using python to make it ready to import to neo4j
- Docker was used to install the neo4j (neo4j version 5.2.0)
- Bash script (start_neo4j.sh) starts the docker container, neo4j server and imports the data
Below are a few sample results of topic modeling
- Topic 1 seem to be concerned with immune response and antibodies
- Topic 2 seem to be talking about effects of pandemic on society, mental health (stress, anxiety) and work environment (behavior, support)
- Topic 3 papers could be related to infection detection, antibody sequencing and virus itself
covid19-literature-analysis
|
|--- data_prep: Code for preprocessing the raw data
|--- cord19-parser.py: A python parser to convert the raw data into a structured CSV file
|--- Data-Preprocessing.ipynb: Data parser but using PySpark
|--- data_viz: Some visualizations to understand the data better
|--- graph_db: Post project exploratory work to store and represent data using neo4j and PySpark GraphFrames
|--- images: README file images
|--- modeling: Modeling work
|--- ppt: Contains a presentation describing the whole project