Paper_Figures.Rmd

---
title: "DNA Methylation Hierarchical Variance Model\nPaper Figures"
author: "Kevin Murgas"
date: "`r Sys.Date()`"
output:
  pdf_document:
    fig_caption: true
    number_section: true
    toc: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(ggpubr)

library(gt)
library(rlang)
library(knitr)
library(gtools)

# set working directory
setwd("/Users/kevinmurgas/Documents/Data+ project/EPIC data")
library(here)

select <- dplyr::select
rename <- dplyr::rename
filter <- dplyr::filter
slice <- dplyr::slice

set.seed(1234)
```

```{r Load_data, echo=FALSE, include=TRUE, message=FALSE}
# Load raw methylation data
load(here("myFA_bulkonly.Rdata"))

stanFile = "FullResultsTCGA_relaxgamma3.Rdata"
load(here(stanFile))

prepdatafolder = "data_final_rg3"
cat(sprintf("Load prepped data dir: %s (needs to be pre-processed in StanAnalysis.Rmd)...",prepdatafolder))
gened_data <- read_csv(here(prepdatafolder, "gened_data.csv"), col_types = list("CHR" = col_character()))
prepped_data <- read_csv(here(prepdatafolder, "prepped_data.csv"))
permed_data <- read_csv(here(prepdatafolder, "permed_data.csv"))
```

\newpage
\section{Model Overview, Priors, Example Site Fit}

\subsection{Data}
Logit-transformed beta values, from EPIC methylation array pre-processed with minfi Noob algorithm. \newline
\newline Sample counts (total): 48 samples from 21 patients. \newline
By tissue: 6 normal samples, 42 paired bulk tumor samples.

\subsection{Model equation}
Let i=patient, j=sample (0=normal, 1,2=tumor sample 1,2)
$$ y_{ij} = \mu + \alpha_i +  (1-\delta_{j0})*(\nu + \beta_i + \gamma_{ij}) + \epsilon_{ij} $$
$$ \alpha_i \sim N(0,\sigma_P) , \beta_i \sim N(0,\sigma_{PT}) , \gamma_{ij} \sim N(0,\sigma_T) $$
\newline Model is fit at each individual CpG site using RStan's MCMC sampling algorithm. \newline
Stan Parameters: 4 chains x 2000 iterations (200 warmup) = 7200 total draws. adapt_delta = 0.999 

\subsection{Prior Probabilities (Fig S1)}
Priors for each parameter (mu, nu, sigmaP/PT/T) were defined using TCGA 450k array data from colorectal tumor patients with paired multiple tumor and/or normal+tumor samples. \newline Empirical values of mean and standard deviations were assesed for all 450k CpG sites. \newline The resulting distributions were fit with: bimodal gaussian (mu), Cauchy (nu), or gamma distributions (sigmas). \newline
SigmaE (error variance) prior was defined using a non-informative gamma distribution: gamma(2,2).
\newline Additionally: Prior relaxation was performed by algebraically increasing the variance of the distributions (or each Gaussian component in the case of mu) while maintaining the same mode (peak).

```{r Priors, echo=FALSE, fig.fullwidth=TRUE, warning=FALSE, comment=NA}
# load TCGA estimates for mu, nu, sigmaP, sigmaPT, sigmaT
load(file="/Users/kevinmurgas/Documents/Data+ project/TCGA data/estimates_allpats.Rdata")

# remove all zeros, and take square root of variance (stored variable) to get sigma/stdev
sigmaP_ests <- sqrt(sigmaP_ests[which(!!sigmaP_ests)])
sigmaPT_ests <- sqrt(sigmaPT_ests[which(!!sigmaPT_ests)])
sigmaT_ests <- sqrt(sigmaT_ests[which(!!sigmaT_ests)])
mu_ests <- mu_ests[which(!!mu_ests)]
nu_ests <- nu_ests[which(!!nu_ests)]

# for each parameter, plot the distribution and overlay the fits
fBG <- function(x, a, m1, s1, m2, s2) a*dnorm(x,m1,s1) + (1-a)*dnorm(x,m2,s2) # bimodal gaussian
post_prior_histogram <- function(vals,priorfun1,priorfun2,bw,xl,titlestr,legFlag=FALSE) {
  # stacked histogram colored by gene-associated sites
  # and prior curve overlaid
  plot_df <- data.frame(val=vals)
  p1 <- ggplot(plot_df, aes(x=val)) +
    geom_histogram(aes(y=..count../(length(vals)*bw), fill="Empirical"), binwidth=bw) +
    stat_function(fun=function(x) priorfun1(x),geom="line",aes(color="Fit")) +
    stat_function(fun=function(x) priorfun2(x),geom="line",aes(color="3x Relax")) +
    ggtitle(titlestr) + xlim(xl) + theme_classic() + xlab(titlestr) + ylab("Freq.") +
    scale_colour_manual(NULL, values = c("Empirical" = "gray","Fit" ="red","3x Relax" = "blue")) +
    scale_fill_manual(NULL,values = c("Empirical"="gray")) +
    theme(legend.position = c(0.75, 0.65))
  if (!legFlag) {p1 <- p1 + theme(legend.position = "none")}
  p1
}

# plot TCGA estimates for mu,nu,sigmaE,sigmaP,sigmaPT,sigmaT; with prior fits overlaid
pM <- post_prior_histogram(mu_ests, function(x) fBG(x,0.3557,-3.0675, 0.75214,1.1290, 1.39197), function(x) fBG(x,0.3557,-3.0675, sqrt(3)*0.75214,1.1290, sqrt(3)*1.39197), 0.1, c(-5,5), "Mu")
pN <- post_prior_histogram(nu_ests, function(x) dcauchy(x,-0.04051, 0.19330), function(x) dcauchy(x,-0.04051, 0.5799), 0.1, c(-5,5), "Nu")
pP <- post_prior_histogram(sigmaP_ests, function(x) dgamma(x, 5.1989, 16.6192), function(x) dgamma(x, 2.7693, 7.0029), 0.05, c(0,2), "SigmaP")
pPT <- post_prior_histogram(sigmaPT_ests, function(x) dgamma(x, 3.5539, 5.5324), function(x) dgamma(x, 2.1457, 2.4819), 0.05, c(0,2), "SigmaPT")
pT <- post_prior_histogram(sigmaT_ests, function(x) dgamma(x, 6.8383, 25.7560), function(x) dgamma(x, 3.3643, 10.430), 0.05, c(0,2), "SigmaT",legFlag=TRUE)
pE <- ggplot() + stat_function(fun=function(x) dgamma(x, 2, 2),geom="line",color="blue") + ggtitle("SigmaE") + xlim(c(0,2)) + theme_classic() + theme(legend.position = "none") + xlab("SigmaE") + ylab("Freq.")

pBlank <- ggplot(data=data.frame(len=c(1,2)),aes(x=len)) + theme_void() + geom_bar(aes(fill="Empirical")) +
    stat_function(fun=function(x) 3,geom="line",aes(color="1x variance")) +
    stat_function(fun=function(x) 4,geom="line",aes(color="3x variance")) + xlim(0,1) + ylim(0,1) + scale_colour_manual(NULL, values = c("Empirical" = "gray","1x variance" ="red","3x variance" = "blue")) + scale_fill_manual(NULL,values = "gray") + theme(legend.position = c(2, 0.75))

p_all <- ggarrange(pM,pN,pE,pP,pPT,pT,hjust=0,nrow=2,ncol=3)
print(p_all)
rm(pM,pN,pBlank,pP,pPT,pT,pE)

```


\newpage
\subsection{Figure 2A: Example site fit, 2B: Example site posteriors}
![Fit](/Users/kevinmurgas/Documents/Data+ project/Manuscript/Paper Figures (screenshots)/example_fit.png)
\newline \newline
![Posteriors](/Users/kevinmurgas/Documents/Data+ project/Manuscript/Paper Figures (screenshots)/example_posteriors.png)

\newpage
\subsection{Summary of neff and Rhat for each parameter}
```{r neffRhat, echo=FALSE, warning=FALSE}
# for each parameter (mu, nu, sigmaP, sigmaPT, sigmaT, sigmaE, lp__)
# reporting: mean n_eff, min n_eff, # n_eff<500, %, # n_eff<100, %
n_eff_mat <-
  data.frame(
  mu = mu_full$n_eff,
  nu = nu_full$n_eff,
  sigmaP = sigmaP_full$n_eff,
  sigmaPT = sigmaPT_full$n_eff,
  sigmaT = sigmaT_full$n_eff,
  sigmaE = sigmaE_full$n_eff,
  lp = lp_full$n_eff
  ) %>% gather(param,value,mu:lp) %>% group_by(param) %>% summarize(
  mean = mean(value),
  min = min(value),
  s500 = sum(value < 500),
  p500 = 100 * mean(value < 500),
  s100 = sum(value < 100),
  p100 = 100 * mean(value < 100),
  .groups = "drop"
  ) %>% arrange(match(.$param,c("mu","nu","sigmaP","sigmaPT","sigmaT","sigmaE","lp")))

kable(n_eff_mat, col.names=c("Parameter", "Mean n_eff", "Min n_eff", "# <500", "% <500", "# <100", "% <100"), digits = 2)

# again for Rhat
Rhat_mat <-
  data.frame(
  mu = mu_full$Rhat,
  nu = nu_full$Rhat,
  sigmaP = sigmaP_full$Rhat,
  sigmaPT = sigmaPT_full$Rhat,
  sigmaT = sigmaT_full$Rhat,
  sigmaE = sigmaE_full$Rhat,
  lp = lp_full$Rhat
  ) %>% gather(param,value,mu:lp) %>% group_by(param) %>% summarize(
  mean = mean(value),
  max = max(value),
  s11 = sum(value > 1.1),
  p11 = 100 * mean(value > 1.1),
  s101 = sum(value > 1.01),
  p101 = 100 * mean(value > 1.01),
  .groups = "drop"
  ) %>% arrange(match(.$param,c("mu","nu","sigmaP","sigmaPT","sigmaT","sigmaE","lp")))

kable(Rhat_mat, col.names=c("Parameter", "Mean Rhat", "Max Rhat", "# >1.1", "% >1.1", "# >1.01", "% >1.01"), digits = 2)
cat("  \n")
```

\subsection{Filter CpG sites}
A filter is applied to remove all sites with Rhat > 1.1 in lp (log-posterior likelihood of entire site fit), and CpG sites mapping to sex chromosomes X and Y are removed as well to eliminate sex-specific bias.

Later, a gene filter is applied to remove all genes with <5 CpG sites.

```{r filter_sites, echo=FALSE, warning=FALSE}
# indices to filter out sites Rhat>1,1
QCinds <- which(lp_full$Rhat > 1.1)
# indices to filter out X and Y chromosome
XYinds <- which(FullAnnotation$CHR %in% c("X","Y"))

# use MethylToSNP package to detect likely SNPs
# from paper: LaBarre et al. 2019, Epigenetics & Chromatin)
# methyltosnp_results file should exist from running StanAnalysis.Rmd
snps <- read.csv(here("methyltosnp_results.csv"))
rownames(snps) <- snps[,1]
snps<-snps[,-1]
SNPinds <- which(FullAnnotation$IlmnID %in% rownames(snps))

remove_inds = union(QCinds, union(XYinds, SNPinds))

# remove the corresponding rows from the results dataframes
FullAnnotation <- FullAnnotation[-remove_inds,]
mu_full <- mu_full[-remove_inds,]
nu_full <- nu_full[-remove_inds,]
sigmaP_full <- sigmaP_full[-remove_inds,]
sigmaPT_full <- sigmaPT_full[-remove_inds,]
sigmaT_full <- sigmaT_full[-remove_inds,]
sigmaE_full <- sigmaE_full[-remove_inds,]
lp_full <- lp_full[-remove_inds,]
CpGscore_full <- CpGscore_full[-remove_inds,]
```

\subsection{Figure 2C: Posterior medians over all CpG sites}
```{r posteriors, echo=FALSE, fig.fullwidth=TRUE, warning=FALSE, comment=NA}
# for each parameter, plot the distribution and overlay the fits
fBG <- function(x, a, m1, s1, m2, s2) a*dnorm(x,m1,s1) + (1-a)*dnorm(x,m2,s2) # bimodal gaussian
post_prior_histogram <- function(postvals,priorfun,bw,xl,titlestr) {
  # histogram with prior curve overlaid
  plot_df <- data.frame(val=postvals)
  p1 <- ggplot(plot_df, aes(x=val)) +
    geom_histogram(aes(y=..density..), binwidth=bw, position="dodge", fill="lightgray") +
    stat_function(fun=function(x) priorfun(x), geom="line", color="black") +
    ggtitle(titlestr) + xlim(xl) + theme_classic() + theme(legend.position = "none") + xlab(titlestr) + ylab("Freq.")
  p1
}

# plot mu,nu,sigmaE,sigmaP,sigmaPT,sigmaT with priors overlaid
pM <- post_prior_histogram(mu_full$p50, function(x) 1.3*fBG(x,0.3557,-3.0675, 1.3027, 1.1290, 2.4110), 0.25, c(-5,5), "Mu") + theme(aspect.ratio = 1)
pN <- post_prior_histogram(nu_full$p50, function(x) 1.5*dcauchy(x,-0.04051, 0.5799), 0.25, c(-5,5), "Nu") + theme(aspect.ratio = 1)
pE <- post_prior_histogram(sigmaE_full$p50, function(x) 3.3*dgamma(x,2,2), 0.05, c(0,2), "SigmaE") + theme(aspect.ratio = 1)
pP <- post_prior_histogram(sigmaP_full$p50, function(x) 1.52*dgamma(x, 2.7693, 7.0029), 0.05, c(0,2), "SigmaP") + theme(aspect.ratio = 1)
pPT <- post_prior_histogram(sigmaPT_full$p50, function(x) 1.6*dgamma(x, 2.1457, 2.4819), 0.05, c(0,2), "SigmaPT") + theme(aspect.ratio = 1)
pT <- post_prior_histogram(sigmaT_full$p50, function(x) 0.97*dgamma(x, 3.3643, 10.430), 0.05, c(0,2), "SigmaT") + theme(aspect.ratio = 1)

p_all <- ggarrange(pM,pN,pE,pP,pPT,pT,hjust=0,nrow=2,ncol=3)
print(p_all)
rm(pM,pN,pE,pP,pPT,pT)

ggsave(filename = "posteriors.svg", plot=p_all)
```

\newpage
\section{Model Fit Results}
\subsection{Fig 3C: Parameter Distributions}
Histograms are median values of posterior distribution
```{r Params_site, echo=FALSE, fig.fullwidth=TRUE, warning=FALSE, comment=NA}
genemask = !(FullAnnotation$UCSC_RefGene_Name=="")
cat(sprintf("N sites: total=%i, gene-associated=%i, non-gene=%i  \n",length(genemask),sum(genemask),length(genemask)-sum(genemask)))

fBG <- function(x, a, m1, s1, m2, s2) a*dnorm(x,m1,s1) + (1-a)*dnorm(x,m2,s2) # bimodal gaussian
post_prior_histogram <- function(postvals,priorfun,bw,xl,titlestr) {
  # histogram and prior curve overlaid
  plot_df <- data.frame(val=postvals)
  p1 <- ggplot(plot_df, aes(x=val)) +
    geom_histogram(aes(y=..density..), binwidth=bw) +
    stat_function(fun=function(x) bw*priorfun(x),geom="line",color="black") +
    ggtitle(titlestr) + xlim(xl) + theme_classic() + theme(legend.position = "none") + xlab(titlestr) + ylab("Freq.")
  p1
}

# plot hierarchical model posterior medians for mu,nu,sigmaE,sigmaP,sigmaPT,sigmaT; with priors overlaid
pM <- post_prior_histogram(mu_full$p50, function(x) 10*fBG(x,0.3557,-3.0675, 0.75214,1.1290, 1.39197), 0.1, c(-5,5), "Mu")
pN <- post_prior_histogram(nu_full$p50, function(x) 15*dcauchy(x,-0.04051, 0.5799), 0.1, c(-5,5), "Nu")
#pE <- post_prior_histogram(sigmaE_full$p50, function(x) 4*dgamma(x,2,2), 0.05, c(0,2), "SigmaE")
#pP <- post_prior_histogram(sigmaP_full$p50^2, function(x) 2*dgamma(x, 2.7693, 7.0029), 0.05, c(0,2), "SigmaP")
#pPT <- post_prior_histogram(sigmaPT_full$p50^2, function(x) 1.8*dgamma(x, 2.1457, 2.4819), 0.05, c(0,2), "SigmaPT")
#pT <- post_prior_histogram(sigmaT_full$p50^2, function(x) 1.2*dgamma(x, 3.3643, 10.430), 0.05, c(0,2), "SigmaT")
pE <- post_prior_histogram(sigmaE_full$p50^2, function(x) 100*dgamma(sqrt(x),2,2), 0.05, c(0,2), "SigmaE^2")
pP <- post_prior_histogram(sigmaP_full$p50^2, function(x) 250*dgamma(sqrt(x), 2.7693, 7.0029), 0.01, c(0,1), "SigmaP^2")
pPT <- post_prior_histogram(sigmaPT_full$p50^2, function(x) 40*dgamma(sqrt(x), 2.1457, 2.4819), 0.05, c(0,2), "SigmaPT^2")
pT <- post_prior_histogram(sigmaT_full$p50^2, function(x) 220*dgamma(sqrt(x), 3.3643, 10.430), 0.01, c(0,1), "SigmaT^2")

# arrange each parameter as a subplot # ,labels=c("M","B","E","P","PT","T","lP","lPT","lT")
p_all <- ggarrange(pM,pN,pE,pP,pPT,pT,hjust=0,nrow=2,ncol=3)
print(p_all)
rm(pM,pN,pE,pP,pPT,pT)
```

\subsection{CpG Conservation score}
We want to understand which sites of DNA methylation are fundamentally conserved within tumors, i.e. lower variation in methlyation within tumor (sigmaT) relative to the variation within normal healthy tissue (sigmaP). Therefore we choose the log-ratio (base 2) of sigmaP/sigmaT to represent a relative conservation score, which is positive when tumor variation is lower than normal variation. We take the median of the posterior distribution constructed from the two sigmaP,T posteriors for this score.

$$\text{score}_\text{med} = \text{Median} (\text{log}_2\frac{\sigma_P^2}{\sigma_T^2}) $$

\subsection{Fig 3A: Conservation score}
```{r Fig3A, echo=FALSE, fig.height=2, fig.width=6.5, fig.fullwidth=TRUE, warning=FALSE}

# staggered (dodge) histograms
post_histogram <- function(postvals,mask,bw,titlestr) {
  # stacked histogram colored by gene-associated sites
  plot_df <- data.frame(val=postvals, gene_assoc=mask)
  p1 <- ggplot(plot_df, aes(x=val)) +
    geom_histogram(aes(y=..density..,fill=gene_assoc), binwidth=bw, position="dodge") +
    ggtitle(titlestr) + theme_classic() + theme(legend.position = "none") + xlab(titlestr) + ylab("Freq.")
}

# overlaid transparent histograms
# post_histogram <- function(postvals,mask,bw,titlestr) {
#   # transparent histogram colored by gene-associated sites
#   plot_df <- data.frame(val=postvals, gene_assoc=mask)
#   gene_assoc_df <- plot_df %>% filter(gene_assoc)
#   not_gene_assoc_df <- plot_df %>% filter(!gene_assoc)
#   p1 <- ggplot(plot_df, aes(x=val)) +
#     geom_histogram(data=not_gene_assoc_df, aes(y=..density..,fill="#00BFC4",alpha=0.4), binwidth=bw) +
#     geom_histogram(data=gene_assoc_df, aes(y=..density..,fill="#F8766D",alpha=0.4), binwidth=bw) +
#     ggtitle(titlestr) + xlab(titlestr) + ylab("Density") + theme_classic() 
#   p1
# }

# plot conservation score
genemask = !(FullAnnotation$UCSC_RefGene_Name=="")
p_scores <- post_histogram(CpGscore_full$median,genemask,0.2,"CpG Conservation score") +
  theme(aspect.ratio=1/4,legend.position = "none") +
  scale_x_continuous(limits=c(-8,8),breaks=c(-8,-4,0,4,8)) + 
  scale_y_continuous(breaks=c(0,0.1,0.2,0.3,0.4,0.5))
print(p_scores)
ggsave("CpGscores_dodge.svg")

# same data but as boxplot
df = data.frame(med=CpGscore_full$median, mask=genemask)
p_scores <- ggplot(data=df) + geom_boxplot(aes(x=med, y = genemask, fill = genemask), outlier.shape = NA) +
  scale_x_continuous(limits=c(-8,8),breaks=c(-8,-4,0,4,8)) + xlab("CpG score") + ylab("Gene-associated") + theme_classic() + theme(aspect.ratio=1/4,legend.position = "none")
print(p_scores)

# statistics: two-sample wilcoxon rank-sum
wilcox.test(x=CpGscore_full$median[genemask], y=CpGscore_full$median[!genemask], alternative = "greater")

# Fig 4B: plot gene scores and adjusted bootstrap distribution
p1 <- permed_data %>% ggplot(aes(x=score_median)) + geom_histogram(aes(y=..count..), binwidth=0.2, fill="lightgray",color = "grey50") + theme_classic() + ylab("Number of genes") + theme(aspect.ratio = 1) + scale_x_continuous(breaks = c(-2,0,2,4,6), limits = c(-3,7)) + scale_y_continuous(breaks = c(0,1000,2000,3000), limits = c(0,3000))
p2 <- permed_data %>% ggplot(aes(x=score_median_v3)) + geom_histogram(aes(y=..count..), binwidth=0.02, fill="lightgray", color="grey50") + xlim(c(0,1)) + theme_classic() + ylab("Number of genes") + theme(aspect.ratio = 1)
p_genes <- ggarrange(p1,p2,nrows=1,ncol=2)
ggsave("gene_scores_median.svg", plot=p1)
ggsave("gene_scores_bootstrap.svg", plot=p2)

# save significant bootstrapped genes as csv
sig_genes = permed_data %>% filter(score_median_v3>=0.95) %>% pull(UCSC_RefGene_Name)
write_csv(data.frame(Gene = sig_genes), file = "sig_cons_genes.csv")
# for Reactome pathway analysis at https://reactome.org/PathwayBrowser
```


\newpage
\subsection{Intra-gene conservation}
We examine intra gene conservation 2 ways: a table looking at gene regulatory regions, and a CpG-distance analysis

\subsection{CpG Island Relation (Fig 3B)}
Table comparing different groups of CpG sites based on CpG Island relationship: Island, Shore, Shelf, Sea
```{r CpGIslandRelation, echo=FALSE, warning=FALSE}
#, include=TRUE, results='asis', fig.keep='all', fig.show='asis', fig.height=3, fig.width=5

# for a table summary of sigmas and conservation score
prepped_data %>%
  group_by(`Island relation` = Relation_to_UCSC_CpG_Island) %>%
  summarise_at(vars(sigmaP:sigmaT,score_median), mean, na.rm = TRUE) %>%
  dplyr::rename(ir = `Island relation`) %>%
  mutate(ir = case_when(is.na(ir) ~ "Sea",
                        str_detect(ir, "^S") ~ str_replace(ir, "^S_", "South "),
                        str_detect(ir, "^N") ~ str_replace(ir, "^N_", "North "),
                        TRUE ~ ir)) %>%
  dplyr::rename(`Island relation` = ir) %>%
  arrange(c(1,4,2,5,3,6)) %>% # for island, N shore, S shore, N shelf, S shelf, sea
  rename(`CpG Conservation Score`=score_median) %>%
  gt() %>%
  fmt_number(columns = vars(sigmaP, sigmaPT, sigmaT, `CpG Conservation Score`), decimals = 3) %>%
  tab_spanner(label = "Hierarchical Variances", columns = vars(sigmaP, sigmaPT, sigmaT)) %>% cols_align("center")

# stats for cpg island relations
my_data = prepped_data %>% select(score_median, Relation_to_UCSC_CpG_Island) %>%
  group_by(ir = Relation_to_UCSC_CpG_Island) %>%
  mutate(ir = case_when(is.na(ir) ~ "Sea",
                        str_detect(ir, "^S") ~ str_replace(ir, "^S_", ""),
                        str_detect(ir, "^N") ~ str_replace(ir, "^N_", ""),
                        TRUE ~ ir))
# summarize mean of each group and convert to fold change
my_data %>% summarise_at(vars(score_median), mean, na.rm = TRUE) %>% mutate(fc = 2^score_median)
# run Kruskal Wallis ANOVA and post-hoc Tukey
kruskal.test(score_median ~ ir, data = my_data)
#install.packages("MultNonParam")
library(MultNonParam)
tukey.kruskal.test(my_data$score_median, my_data$ir, alpha = 0.001)

# Fig 3B: plot CpG island relation score distributions
prepped_data %>% select(Name, Relation_to_UCSC_CpG_Island, score_median) %>%
  dplyr::rename(ir = Relation_to_UCSC_CpG_Island, score=score_median) %>%
  mutate(ir = case_when(is.na(ir) ~ "Sea",
                        str_detect(ir, "^S") ~ str_replace(ir, "^S_", ""),
                        str_detect(ir, "^N") ~ str_replace(ir, "^N_", ""),
                        TRUE ~ ir)) %>%
  #mutate(ir = factor(ir, levels = c("Island", "Shore", "Shelf", "Sea"))) %>%
  #ggplot() + geom_violin(aes(x=ir,y=score,fill=ir),draw_quantiles = 0.5) + theme_classic() + 
  mutate(ir = factor(ir, levels = c("Sea", "Shelf", "Shore", "Island"))) %>%
  ggplot() + geom_boxplot(aes(y=ir,x=score, fill=ir), outlier.shape = NA) + theme_classic() +
  scale_x_continuous(limits=c(-5,5),breaks=c(-4,-2,0,2,4)) + theme(aspect.ratio=1/4,legend.position = "none")
# ggsave("CpGscore_islands_violin.svg")
ggsave("CpGscore_islands_boxplot.svg")
```

\subsection{Functional Regions (Fig 3C)}
Table comparing different groups of CpG sites based on functional regions: 5'UTR, TSS1500, TSS200, 1stExon, Body, ExonBnd, 3'UTR
```{r FunctionalRegions, echo=FALSE, warning=FALSE}
# stats for functional regions
summarize_conservation <- function(data,str) {
  indsin <- str_detect(data$UCSC_RefGene_Group, str) & !is.na(data$UCSC_RefGene_Group)
  indsout <- !str_detect(data$UCSC_RefGene_Group, str) | is.na(data$UCSC_RefGene_Group)
  if (str=="any") {
    indsin <- !is.na(data$UCSC_RefGene_Group)
    indsout <- is.na(data$UCSC_RefGene_Group)
  }
  data.frame(cons_in=mean(data$score_median[indsin], na.rm=TRUE),
         cons_out=mean(data$score_median[indsout],na.rm=TRUE),
         nsites=sum(indsin),
         nout=sum(indsout),
         pval=wilcox.test(data$score_median[indsin],data$score_median[indsout])$p.value)
}

temp <- bind_rows({prepped_data %>% summarize_conservation("any") %>% mutate(region="Gene-Associated")},
          {prepped_data %>% summarize_conservation("TSS1500") %>% mutate(region="TSS1500")},
          {prepped_data %>% summarize_conservation("TSS200") %>% mutate(region="TSS200")},
          {prepped_data %>% summarize_conservation("5'UTR|5UTR|5URT") %>% mutate(region="5'UTR")},
          {prepped_data %>% summarize_conservation("1stExon") %>% mutate(region="1stExon")},
          {prepped_data %>% summarize_conservation("ExonBnd") %>% mutate(region="ExonBnd")},
          {prepped_data %>% summarize_conservation("Body") %>% mutate(region="Body")},
          {prepped_data %>% summarize_conservation("3'UTR|3UTR") %>% mutate(region="3'UTR")}) %>%
  select(c(region,cons_in,cons_out,nsites,nout,pval)) %>%
  mutate(fold_change = 2^cons_in, fold_diff = 2^(cons_in - cons_out)) # calculating fold difference (2^ (log2 difference))
temp %>% gt() %>% fmt_number(columns = vars(cons_in,cons_out,pval), decimals = 3)

# multiple regression stats
FRdata = data.frame(CpGscore = prepped_data$score_median ) %>%
  mutate(FR = pull(prepped_data,UCSC_RefGene_Group)) %>%
  mutate(FR=if_else(is.na(FR),"NA",FR)) %>%
  mutate(iTSS200 = str_detect(FR,"TSS200"),
         i5UTR = str_detect(FR,"5'UTR|5UTR|5URT"),
         iBody = str_detect(FR,"Body"),
         i3UTR = str_detect(FR,"3'UTR|3UTR"),
         iTSS1500 = str_detect(FR,"TSS1500"),
         i1stEx = str_detect(FR,"1stExon"),
         iExBnd = str_detect(FR,"ExonBnd"))

# lm model
fr.lm = lm(formula = CpGscore ~ iTSS200 + i5UTR + iBody + i3UTR + iTSS1500 + i1stEx + iExBnd, data = FRdata)
summary(fr.lm)

# Fig 3C: plot functional region score distributions
temp <- prepped_data %>% select(Name, UCSC_RefGene_Group, score_median) %>%
  separate_rows(UCSC_RefGene_Group, sep=";") %>%
  distinct() %>%
  dplyr::rename(fr = UCSC_RefGene_Group, score=score_median) %>%
  mutate(fr = case_when(fr=="" ~ "NA",
                        str_detect(fr, "5URT") ~ "5'UTR",
                        str_detect(fr, "3UTR") ~ "3'UTR",
                        TRUE ~ fr)) %>%
  distinct()
#regions = c("TSS1500","Body", "3'UTR","1stExon","TSS200","5'UTR","ExonBnd")
regions = rev(c("TSS200","1stExon", "5'UTR","TSS1500","Body","3'UTR","ExonBnd"))
temp %>%
  mutate(fr = factor(fr, levels = regions)) %>%
  filter(fr %in% regions) %>%
  #ggplot() + geom_violin(aes(x=fr,y=score,fill=fr), draw_quantiles = 0.5) + theme_classic() +
  ggplot() + geom_boxplot(aes(y=fr,x=score, fill=fr), outlier.shape = NA) + theme_classic() +
  scale_x_continuous(limits=c(-5,5),breaks=c(-4,-2,0,2,4)) + theme(aspect.ratio=1/4,legend.position = "none") + geom_vline(xintercept = 0, color="black", alpha=0.2)
# ggsave("CpGscore_regions_violin.svg")
ggsave("CpGscore_regions_boxplot.svg")
#rm(temp)

```

\newpage
\subsection{Figure 3D: CpG Distance Analysis}
Corrected for gene direction, based on Ensembl gene database \newline
```{r Fig3D, echo=FALSE, warning=FALSE, message=FALSE, fig.width=7, fig.height=3, comment=NA}
# load ensembl db
library(EnsDb.Hsapiens.v75) # ensembl annotation with gene position (hg19 version)
edb <- EnsDb.Hsapiens.v75
geneKeys <- keys(edb, keytype="GENENAME") %>% intersect(unique(gened_data$UCSC_RefGene_Name))
edbData<- AnnotationDbi::select(edb, keys=geneKeys, keytype="GENENAME",
                                columns=c("GENEBIOTYPE", "GENESEQSTART", "GENESEQEND", "SEQNAME", "SEQSTRAND")) %>%
  filter(SEQNAME %in% c(as.character(1:22),"X","Y")) %>%
  mutate(GENESTART = GENESEQSTART*(SEQSTRAND==1) + GENESEQEND*(SEQSTRAND==-1)) %>%
  select(-c(GENESEQSTART,GENESEQEND))

# calculate CpG distance based on Ensembl gene start position and direction
distanced_data <- gened_data %>%
  filter(!is.na(UCSC_RefGene_Name)) %>%
  filter(UCSC_RefGene_Name %in% geneKeys) %>%
  mutate(EDBCHR=edbData$SEQNAME[match(UCSC_RefGene_Name,edbData$GENENAME)],
         EDBSTART=edbData$GENESTART[match(UCSC_RefGene_Name,edbData$GENENAME)],
         EDBDIR=edbData$SEQSTRAND[match(UCSC_RefGene_Name,edbData$GENENAME)],
         EDBTYPE=edbData$GENEBIOTYPE[match(UCSC_RefGene_Name,edbData$GENENAME)]
         ) %>%
  filter(CHR==EDBCHR) %>%
  mutate(cpgdist = if_else(EDBDIR==1, MAPINFO-EDBSTART, EDBSTART-MAPINFO))

distance_length <- distanced_data %>%
  filter(!is.na(UCSC_RefGene_Name)) %>%
  select(CHR, UCSC_RefGene_Name, cpgdist, c(score_median)) %>%
  mutate(location_bin = cut(cpgdist, c(seq(-1500, 9000, by = 100)), include.lowest = TRUE)) %>%
  group_by(location_bin) %>%
  summarise(mean_score_median = mean(score_median), n_obs = n(), .groups='drop') %>%
  mutate(location_bin = as.numeric(str_extract(location_bin, "(?<=,)(.*)(?=])")))

# plot for conservation scores
distance_length %>% drop_na() %>%
  ggplot(aes(x = location_bin, y = mean_score_median)) +
  geom_point(color="black") +
  theme_bw() + theme(legend.position = "none", aspect.ratio = 3/8) +
  xlab("Distance from gene start site (binned to 100 bp)") + ylab("Average Conservation Score") +
  scale_x_continuous(breaks = seq(-1500,10000,1500))
cat("\n")
ggsave("CpGdist.svg")
```

\newpage
\section{Gene Conservation Scores}
Genes are scored by average of all CpG sites within gene
\subsection{Fig 4A: Example Genes: TUBA1A, TTN, HLA-A}
```{r examplegenes, echo=FALSE, warning=FALSE, message=FALSE, fig.width=7, fig.height=2, comment=NA}
# for these genes, show the CpG site distribution, and add vertical lines for the CpG sites in the gene (gened_data?), then a bold yellow vertical line for the overall mean

plotGeneScore <- function(geneStr,norm) {
  CpGscores <- gened_data %>% filter(str_detect(UCSC_RefGene_Name,paste0("^",geneStr,"$"))) %>% pull(score_median)
  ggplot(prepped_data) + geom_histogram(aes(x=score_median, y=..count../nrow(prepped_data)), binwidth=0.05,fill="lightgray") + ggtitle(sprintf("%s: CpG scores (%i sites)",geneStr, length(CpGscores))) +
    scale_x_continuous(limits=c(-5,5), breaks=c(-4,-2,0,2,4)) +
    scale_y_continuous(limits=c(0,0.03),breaks=c(0,0.01,0.02,0.03)) +
    xlab("CpG conservation score") + ylab("Freq.") + theme_classic() + theme(legend.position = "none") +
    #geom_vline(xintercept=CpGscores, color="red", alpha=0.5, size=0.2) +
    geom_histogram(data=data.frame(score=CpGscores),aes(x=score,y=norm*..count../length(CpGscores)),binwidth=0.05, fill="blue",alpha=0.5) +
    geom_vline(xintercept=mean(CpGscores), color="blue",linetype="dashed", size=1)
}

cat("\n")
# Fig 4A: Example genes CpG site distributions
plotGeneScore("TUBA1A",0.2)
#ggsave("genescore_TUBA1A.svg")
cat("\n")
plotGeneScore("TTN",0.26)
#ggsave("genescore_TTN.svg")
cat("\n")
plotGeneScore("HLA-A",0.18)
#ggsave("genescore_HLA-A.svg")
cat("\n")
plotGeneScore("APC",0.12) + theme(aspect.ratio = 1)
#ggsave("genescore_APC.svg")
cat("\n")
plotGeneScore("KRAS",0.18) + theme(aspect.ratio = 1)
#ggsave("genescore_KRAS.svg")

# stats for each gene
# do one sample wilcoxon rank sum
# TUBA1A
gened_data %>% filter(str_detect(UCSC_RefGene_Name,paste0("^TUBA1A$"))) %>%
  summarize(n = n(), mean=mean(score_median)) %>% mutate(fc = 2^mean)
gened_data %>% filter(str_detect(UCSC_RefGene_Name,paste0("^TUBA1A$"))) %>% pull(score_median) %>% wilcox.test(mu = 0, alternative = "two.sided")
# TTN
gened_data %>% filter(str_detect(UCSC_RefGene_Name,paste0("^TTN$"))) %>%
  summarize(n = n(), mean=mean(score_median)) %>% mutate(fc = 2^mean)
gened_data %>% filter(str_detect(UCSC_RefGene_Name,paste0("^TTN$"))) %>% pull(score_median) %>% wilcox.test(mu = 0, alternative = "two.sided")
# HLA-A
gened_data %>% filter(str_detect(UCSC_RefGene_Name,paste0("^HLA-A$"))) %>%
  summarize(n = n(), mean=mean(score_median)) %>% mutate(fc = 2^mean)
gened_data %>% filter(str_detect(UCSC_RefGene_Name,paste0("^HLA-A$"))) %>% pull(score_median) %>% wilcox.test(mu = 0, alternative = "two.sided")
```

\subsection{Fig 4B: Gene-average Conservation Score}
Examine the distribution of gene-average conservation scores, as well as the distribution of adjusted-boostrap p-values.
```{r Gene_Score, echo=FALSE, fig.fullwidth=TRUE, warning=FALSE, comment=NA}

permed_data %>% ggplot() + geom_histogram(aes(x=score_median, y=..density..), binwidth=0.2, color="gray",fill="lightgray") + theme_classic()

permed_data %>% ggplot() + geom_histogram(aes(x=score_median_v3, y=..count..), binwidth=0.02, color="gray",fill="lightgray") + theme_classic() + geom_vline(aes(xintercept = 0.95), color="red", linetype="dashed")
```

\subsection{Fig 4C: CRC Cancer Genes Databases}
Including sets of genes implicated in CRC, based on common mutations from COSMIC database and Atlas of Genetics in Oncology (AGO), as well as a cancer essential gene list based on experimental CRISPRi knockout library. We also compared two gene sets from MSigDB which should be irrelevant to cancer, a set of cardiac progenitor differentiation genes (MSigDB: C2: WP_CARDIAC_PROGENITOR_DIFFERENTIATION) and a set of neuron marker genes (MSigDB: C2: LEIN_NEURON_MARKERS).
```{r cancergenes, echo=FALSE, warning=FALSE, message=FALSE, fig.width=7, fig.height=2, comment=NA}

# CRC gene database 1: COSMIC
cancergenes1 <- read_csv("/Users/kevinmurgas/Documents/Data+ project/Cancer gene lists/SangerCensus_COSMIC_CRC.csv")
# CRC gene database 2: AGO
cancergenes2 <- read_csv("/Users/kevinmurgas/Documents/Data+ project/Cancer gene lists/CRCgenes_atlasgeneticsoncology.csv")
essentialgenes <- read_delim("/Users/kevinmurgas/Documents/Data+ project/Manuscript/Bioinformatics Revision/core-essential-genes-sym_HGNCID.txt", delim = "\t", col_names = F)

# additional non-relevant gene sets from MSigDB (LEIN_NEURON, CARDIAC_PROGENITOR)
# MSigDB: C2: LEIN_NEURON_MARKERS
neuron <- read_delim("/Users/kevinmurgas/Documents/Data+ project/geneset_neuron.gmt", delim="\t") %>% colnames()
# MSigDB: C2: WP_CARDIAC_PROGENITOR_DIFFERENTIATION
cardprog <- read_delim("/Users/kevinmurgas/Documents/Data+ project/geneset_cardprog.gmt", delim="\t") %>% colnames()

# compute % overlap of database genes and significant genes (By adj bootstrap)
# n1 = sum(permed_data %>% filter(score_median_v3>0.95) %>% pull(UCSC_RefGene_Name) %in% (cancergenes1 %>% pull(`Gene Symbol`)))
# n2 = sum(permed_data %>% filter(score_median_v3>0.95) %>% pull(UCSC_RefGene_Name) %in% (cancergenes2 %>% pull(CancerGenes)))

# plot violins of: all genes, COSMIC, AGO
perm_temp <- permed_data %>% select(c(UCSC_RefGene_Name, score_median,score_median_v3,n)) %>% mutate(score_median = 2*score_median)
genes_dbs <- rbind(perm_temp %>% mutate(group="all genes"),
              perm_temp %>% filter(UCSC_RefGene_Name %in% (cancergenes1 %>% pull(`Gene Symbol`))) %>% mutate(group="COSMIC"),
              perm_temp %>% filter(UCSC_RefGene_Name %in% (cancergenes2 %>% pull(CancerGenes))) %>% mutate(group="AGO"),
              perm_temp %>% filter(UCSC_RefGene_Name %in% (essentialgenes %>% pull(X1))) %>% mutate(group="DEPMAP"),
              perm_temp %>% filter(UCSC_RefGene_Name %in% neuron) %>% mutate(group="neuron"),
              perm_temp %>% filter(UCSC_RefGene_Name %in% cardprog) %>% mutate(group="cardiac")
              )
# Fig 4C: plot distribution
pDB <- genes_dbs %>% mutate(group = factor(group, levels = c("all genes", "neuron", "cardiac", "AGO", "COSMIC", "DEPMAP"))) %>%
  #ggplot() + geom_violin(aes(x=score_median,y=group,fill=group), draw_quantiles = 0.5) +
  ggplot() + geom_boxplot(aes(x=score_median,y=group,fill=group), outlier.shape = NA) +
  scale_x_continuous(limits=c(-5,7), breaks = c(-4, -2,0, 2, 4, 6)) + theme_classic() + ylab("Gene set") + xlab("Gene Conservation Score") + geom_vline(xintercept=0, color="black", alpha=0.2) + theme(aspect.ratio=60/156.3,legend.position = "none") + scale_fill_manual(values=c("gray","red", "red", "deepskyblue", "deepskyblue", "deepskyblue"))
# # add hashmarks for each gene (doesn't look great)
# marks <- genes_dbs %>% mutate(x=score_median, y=if_else(group=="COSMIC",3,if_else(group=="AGO",2,1))) %>% select(x,y)
# marks2 <- genes_dbs %>% filter(score_median_v3 > 0.95) %>% mutate(x=score_median, y=if_else(group=="COSMIC",3,if_else(group=="AGO",2,1))) %>% select(x,y)
# pDB + geom_segment(data=marks2, aes(x=x, xend=x, y=y, yend=y+0.2), color="red") + geom_segment(data=marks, aes(x=x, xend=x, y=y, yend=y+0.1), color="black", alpha=0.5)
pDB
# ggsave("database_violins.svg")
ggsave("database_boxplot_depmap.svg")

genes_dbs %>% group_by(group) %>% summarize(mean = mean(score_median))

# stats
wilcox.test(x = genes_dbs$score_median[genes_dbs$group=="COSMIC"], y = genes_dbs$score_median[genes_dbs$group=="all genes"], alternative = "greater")
wilcox.test(x = genes_dbs$score_median[genes_dbs$group=="AGO"], y = genes_dbs$score_median[genes_dbs$group=="all genes"], alternative = "greater")
wilcox.test(x = genes_dbs$score_median[genes_dbs$group=="DEPMAP"], y = genes_dbs$score_median[genes_dbs$group=="all genes"], alternative = "greater")
wilcox.test(x = genes_dbs$score_median[genes_dbs$group=="cardiac"], y = genes_dbs$score_median[genes_dbs$group=="all genes"], alternative = "less")
wilcox.test(x = genes_dbs$score_median[genes_dbs$group=="neuron"], y = genes_dbs$score_median[genes_dbs$group=="all genes"], alternative = "less")
```


```{r Exit, include=FALSE}
knit_exit()
```