method.Rmd

---
title: "Method Summary"
author:
  - name: Nan Xiao
    url: https://nanx.me/
    affiliation: Seven Bridges
    affiliation_url: https://www.sevenbridges.com/
  - name: Soner Koc
    url: https://github.com/skoc
    affiliation: Seven Bridges
    affiliation_url: https://www.sevenbridges.com/
  - name: Kaushik Ghose
    url: https://kaushikghose.wordpress.com/
    affiliation: Seven Bridges
    affiliation_url: https://www.sevenbridges.com/
date: "`r Sys.Date()`"
output:
  distill::distill_article:
    toc: yes
bibliography: rankv.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```

# Motivation

Numerous signal detection methods have been proposed in the past decades for pharmacovigilance monitoring in large databases. These methods often produce a ranked list of detected signals (anomalies) warranting further investigations. However, there is still a lack of a holistic view and integration of the results from these approaches. Here, we developed a new signal detection method that can ensemble the vaccine safety signals detected from any signal detection methods, and generate an aggregated list of top signals based on distance metric optimization. Our method can potentially enhance the weak signals from individual methods and, at the same time, reduce the number of false-positive signals discovered by chance.

A flow diagram of our method:

<img src="assets/flow.svg" alt="">

# Base signal detectors and signal rankers

We included four mainstream signal detection methods that have already been applied in real-world pharmacovigilance monitoring by regulatory agencies globally. These methods (or metrics) are:

- GPS - Gamma Poisson Shrinker
- PRR - Proportional Reporting Ratio
- ROR - Reporting Odds Ratio
- BCPNN - Bayesian confidence propagation neural network

As expected, the signal detection results from these methods are somehow similar (in terms of high-ranking vaccine-adverse event pairs), but different in the ranking details. We used the implementations available from the R packages [openEBGM](https://cran.r-project.org/package=openEBGM) [@canida2017] and [PhViD](https://cran.r-project.org/package=PhViD) [@ahmed2010].

# Rank aggregation for ensembled safety signal detection

To aggregate the ranked safety signal lists detected by the multiple methods above, we model it as an optimization problem:

$$
\delta^* = \arg \min \sum_{i=1}^{m} d(\delta, L_i)
$$

where $\sigma$ is an "ideal" ranked aggregated safety signal list of length $k = |L_i|$, $d$ is a distance function that can measure the distance between rankings (Spearman footrule distance here), and $L_i$ is the ranked list of detected signals generated by each base method. The idea is to find a $\delta^*$ that minimizes the total distance between $\delta$ and $L_i$.

A similar rank aggregation problem also exists in gene list prioritization for high-throughput data analysis, where a number of ordered genes lists discovered by statistical tests are aggregated. The R package [RankAggreg](https://cran.r-project.org/package=RankAggreg) [@pihur2007] was repurposed to solve the same optimization problem here.

# Data

The raw data used in this solution is downloaded from the [VAERS database](https://vaers.hhs.gov/data.html), covering 30 years (1990-2019) of domestic vaccine adverse event reports in the United States. The raw data is then cleaned up and transformed into an analyzable format. About 3.44 million vaccine-adverse event pairs are extracted and included in the analysis.

# Code and website

We created a companion website detailing our approach, analysis pipeline, and findings in this challenge. The site is accessible from: https://nanx.me/rankv/. All code available on GitHub: https://github.com/nanxstats/rankv.

# Conclusions

Besides the potential health-impacting signals, a considerable proportion of our findings in the top-ranked vaccine-adverse event pair list indicate a possibility to improve the vaccine administration process or to improve vaccine product labeling, and guiding the improvement of the upstream reporting data quality and the data ingestion procedures.

This solution also verifies the concept that by harnessing the power of open data and high-quality open source data analysis software, we can quickly develop new analytical approaches and flexible pipelines for extracting new insights from public health information, and present both of the process and the results to the community, thus increase computational transparency and reproducibility.