A project to cluster documents from PubMed based on the disease concepts described in them. The input to this class is a a list of PMIDs, and the output is the clusters for each of the PubMed documents in the input.
The steps followed in this project are:
- Fetch XML documents from PubMed
- Parse XML documents to extract title and abstract
- Pre-process the fetched documents to extract disease words
- Create a document matrix based on TF-IDF
- Perform clustering
- Evaluate results
Environment:
- Python 3.7 or Anaconda (preferred) for Python 3.7.
Python libraries (with pip
):
- BioPython (for fetching PubMed data):
pip install -U biopython
- NumPy (for matrix operations):
pip install -U numpy scipy matplotlib
- SciKit-Learn (for clustering algorithms):
pip install -U scikit-learn
- Gensim (for TF-IDF model):
pip install -U gensim
- BeautifulSoup4 (to parse XML):
pip install -U beautifulsoup4
Other tools:
- UMLS MetaMap (to extract disease words) Note: A Free UMLS account is required to download the MetaMap binary from the installation link.
- The first step is to fetch the documents from PubMed and parse them. They are parsed using BeautifulSoup's XML parser, and the text in the
ArticleTitle
andAbstract
tags is extracted. - The extracted text is then piped to MetaMap. MetaMap queries are run only to identify words from the articles that are disease words. For this, the MetaMap query is limited to show results from
disease or syndrome
andneoplatic process
semantic types. At the end, the ontology the disease concept is matched to is used instead of the original word that appears in the article. This helps in standardizing concepts and replacing acronyms with the concept. - TF-IDF is used to generate the weights for each word in each document. The words appearing in fewer than 3 documents are dropped since they produce noise. Using the TF-IDF weights, the document matrix is generated which is used as an input to the clustering algorithm.
- The clustering algorithm used is Affinity Propagation. The damping factor for Affinity Propagation was set to minimum (
0.5
), after experimenting with various values and observing results and the iterations were set to500
to ensure it runs to completion and is not terminated. The reasons for choosing Affinity Propagation over other clustering algorithms were the lack of availability of number of clusters, small number of samples and fast convergence. However, this algorithm is not scalable for a large number of samples. If the number of clusters is provided or internal or external metrics are used for evaluation, k-means or hierarchical clustering are better alternatives. - The evaluation of the clustering, provided labels, is performed using purity and F-measure.
An example script is provided as a part of main.py
. This script contains the sample parameters to be passed to the class and function. An example of how to use the provided project is also shown in the Jupyter Notebook example.ipynb
.
The class PubMedClustering
contains the methods and variables that are used in executing the whole pipeline. During initialization, the following parameters are to be passed to the function:
pubmed_ids
(required): This parameter contains either a list of PMID strings, a string containing comma-separated PMIDs or the disk location to a text file containing a list of PMIDs. If it is a disk location, ensure theis_file
flag is set toTrue
.is_file
(optional): Boolean flag to specify ifpubmed_ids
is a text file that should be read. Default:False
.metamap
(optional): Location of MetaMap binary. Without MetaMap, the pre-processing will not run and the program will terminate. Default:/opt/public_mm/bin/metamap16
.email
(optional): Email ID. Required to query PubMed using the BioPython library. Default:"Your.Name.Here@example.org"
.labels
(optional): Either a dictionary containing PMID -> cluster pairs or a location to a text file that of the formPMID\tlabel
. If it is a text file, ensure the flaglabels_is_file
is set toTrue
. Default:None
.labels_is_file
(optional): Boolean flag to specify ifpubmed_ids
is a text file that should be read. Default:False
.write_to_file
(optional): Boolean flag to specify if the output should be written to a file where each line is of the formPMID\tlabel
. DefaultFalse
.output_file
(optional): Location of output file. If not specified andwrite_to_file
andis_file
is set,output_file
will be set topubmed_ids
, with_clustered
appended to its end.
After initializing the class, use the run()
method of the class to execute the whole pipeline. Upon completion, the results can be seen by querying the clustering_results_dict
property (a dictionary of PMID -> cluster). If the labels are provided, the evaluation results can be seen by querying the purity
, precision
, recall
and f_measure
properties of the object.
- Running time: The running time of the pipeline is
O(N^2 T)
, whereN
is the number of samples andT
is the iterations because of the clustering algorithm used. All of the other steps areO(N)
, with the exception of MetaMap queries. Running MetaMap queries depends on the hyper-parameters chosen in MetaMap's execution, but in general, it is slow to execute. - Design choices: I created a primary class that the user can use to interact with, and made it simple for the user to interact with. I moved all of the helper functions (fetching and parsing PubMed XML, running and parsing MetaMap) outside the class so that only essential class-related functions are within the class. I used private methods within the class, but made all of the parameters available so that they can be queried and observed.
- Performance Evaluation: Purity and F-measure (both implemented) are metrics chosen to evaluate the clustering performance.
- Larger datasets: Given a larger dataset, a combination of TF-IDF with word embeddings would improve the clustering results. The requirement for a larger dataset is necessary to train a word embeddings model and capture syntactic variations in the dataset.