AlexsLemonade · jaclyn-taroni · Nov 4, 2019 · Nov 3, 2019 · Nov 3, 2019 · Nov 3, 2019
diff --git a/Dockerfile b/Dockerfile
@@ -15,7 +15,7 @@ RUN apt-get update -qq && apt-get -y --no-install-recommends install \
 
 # Required forinteractive sample distribution plots
 # map view is needed to create HTML outputs of the interactive plots
-RUN apt-get update -qq && apt-get -y --no-install-recommends install \    
+RUN apt-get update -qq && apt-get -y --no-install-recommends install \
     && install2.r --error \
     --deps TRUE \
     gdalUtils \
@@ -28,9 +28,9 @@ RUN apt-get update -qq && apt-get -y --no-install-recommends install \
     mapview
 
 # Installs packages needed for still treemap, interactive plots, and hex plots
-# Rtsne and umap are required for dimension reduction analyses 
+# Rtsne and umap are required for dimension reduction analyses
 # optparse is needed for passing arguments from the command line to R script
-RUN apt-get update -qq && apt-get -y --no-install-recommends install \    
+RUN apt-get update -qq && apt-get -y --no-install-recommends install \
     && install2.r --error \
     --deps TRUE \
     R.utils \
@@ -72,7 +72,7 @@ RUN R -e "devtools::install_github('timelyportfolio/d3treeR', ref = '0eaba7f1c64
 RUN R -e "devtools::install_github('clauswilke/colorblindr', ref = '1ac3d4d62dad047b68bb66c06cee927a4517d678', dependencies = TRUE)"
 
 # Required for sex prediction from RNA-seq data
-RUN apt-get update -qq && apt-get -y --no-install-recommends install \    
+RUN apt-get update -qq && apt-get -y --no-install-recommends install \
     && install2.r --error \
     --deps TRUE \
     glmnet \
@@ -86,21 +86,26 @@ RUN apt-get -y update && apt-get install -y \
    && rm -rf /var/lib/apt/lists/
 
 # Install for SNV comparison plots
-RUN apt-get update -qq && apt-get -y --no-install-recommends install \    
+RUN apt-get update -qq && apt-get -y --no-install-recommends install \
     && install2.r --error \
     --deps TRUE \
     UpSetR
 
 RUN R -e "devtools::install_github('const-ae/ggupset', ref = '7a33263cc5fafdd72a5bfcbebe5185fafe050c73', dependencies = TRUE)"
 
 # GGally and its required packages
-RUN apt-get update -qq && apt-get -y --no-install-recommends install \    
+RUN apt-get update -qq && apt-get -y --no-install-recommends install \
     && install2.r --error \
     lattice \
     rpart \
     class \
     MASS \
     GGally
 
+# Help display tables in R Notebooks
+RUN apt-get update -qq && apt-get -y --no-install-recommends install \
+    && install2.r --error \
+    flextable
+
 #### Please install your dependencies here
 #### Add a comment to indicate what analysis it is required for
diff --git a/analyses/sample-distribution-analysis/01-filter-across-types.R b/analyses/sample-distribution-analysis/01-filter-across-types.R
@@ -116,6 +116,9 @@ ggplot2::ggsave(
   height = 10
 )
 
+# TODO: rerun this when the primary_site column gets fixed
+# https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/214
+
 # data.frame with the location where each cancer type in the dataset is
 # expressed, sorted to show highest expression
 brain_location <- histologies_df %>%

diff --git a/analyses/sample-distribution-analysis/03-tumor-descriptor-and-assay-count.Rmd b/analyses/sample-distribution-analysis/03-tumor-descriptor-and-assay-count.Rmd
@@ -0,0 +1,262 @@
+---
+title: "Examine `tumor_descriptor` and `experimental_strategy` distributions"
+output: 
+  html_notebook:
+    toc: TRUE
+    toc_float: TRUE
+author: J. Taroni for ALSF CCDL
+date: 2019
+---
+
+In this notebook, we will explore the distribution of primary vs. other samples, as there are multiple samples from the same individual ([#155](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/155)).
+Here, "other" can refer to either samples/biospecimens from progressive disease or recurrence.
+
+We have independent sets of samples for genomic assays (e.g., WGS, WXS; described [here](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/README.md#data-formats)) generated by @jashapiro for use in downstream analyses.
+(You can see more background in [this notebook](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/independent-samples/00-repeated-samples.nb.html).)
+
+We can imagine that there are some analyses that may compare primary vs. recurrence that expect both WGS/WXS _and_ RNA-seq data.
+
+We have not yet looked at:
+
+* How many pairs of genomic and transcriptomic assays are there, i.e., RNA-Seq and WGS from the same timepoint for a participant?
+* What is breakdown by **histology** for cases where there are multiple samples from the same individual?
+
+```{r}
+library(dplyr)
+# this library will help display tables with smaller font for easy viewing
+# in the HTML setting
+library(flextable)
+```
+
+## Read in histologies file
+
+`pbta-histologies.tsv` contains all the clinical information.
+
+```{r}
+histology_file <- file.path("..", "..", "data", "pbta-histologies.tsv")
+histology_df <- readr::read_tsv(histology_file)
+```
+
+For WGS and WXS samples, we'll have tumor and normal to get the somatic calls.
+We'll limit this to tumor samples to avoid double-counting participants and we'll remove derived cell lines.
+
+```{r}
+tumor_df <- histology_df %>%
+  filter(sample_type == "Tumor",
+         composition == "Solid Tissue") 
+```
+
+## Number of each assay
+
+First, we'll examine how many of each type of assay we have (tumors only).
+This information is stored in the `experimental_strategy` column.
+
+```{r}
+tumor_df %>%
+  group_by(experimental_strategy) %>%
+  tally() %>%
+  arrange(desc(n)) %>% 
+  regulartable() %>%
+  fontsize(size = 12, part = "all")
+```
+
+## Primary vs. other (recurrence, progressive disease)
+
+We'll look at the breakdown of the `tumor_descriptor` column, separating the genomic assays from transcriptomic assays.
+
+```{r}
+tumor_descriptor_df <- tumor_df %>%
+  select(Kids_First_Participant_ID, 
+         experimental_strategy, 
+         tumor_descriptor) %>%
+  arrange(Kids_First_Participant_ID)
+```
+
+### Genomic assays
+
+Setting aside the `Panel` sample for the moment and only looking at WXS and WGS assays.
+We're collapsing the different values in `tumor_descriptor` to form a single descriptor when there are multiple types of tumors from the same individual.
+
+```{r}
+genomic_df <- tumor_descriptor_df %>%
+  filter(experimental_strategy %in% c("WGS", "WXS")) %>%
+  group_by(Kids_First_Participant_ID) %>%
+  summarize(descriptors = paste(sort(unique(tumor_descriptor)),  
+                                collapse = ", "),
+            experimental_strategy = paste(sort(unique(experimental_strategy)),
+                                          collapse = ", "))
+```
+
+```{r}
+genomic_df %>%
+  group_by(descriptors) %>%
+  tally() %>%
+  arrange(desc(n)) %>% 
+  regulartable() %>%
+  fontsize(size = 12, part = "all")
+```
+
+Only primary samples is the most common case for genomic assays.
+
+### Transcriptomic assays
+
+Looking only at RNA-seq samples and performing the same collapsing of the `tumor_descriptor` column.
+
+```{r}
+transcriptomic_df <- tumor_descriptor_df %>%
+  filter(experimental_strategy == "RNA-Seq") %>%
+  group_by(Kids_First_Participant_ID) %>%
+  summarize(descriptors = paste(sort(unique(tumor_descriptor)),  
+                                collapse = ", "),
+            experimental_strategy = unique(experimental_strategy))
+```
+
+```{r}
+transcriptomic_df %>%
+  group_by(descriptors) %>%
+  tally() %>%
+  arrange(desc(n)) %>% 
+  regulartable() %>%
+  fontsize(size = 12, part = "all")
+```
+
+The same holds for RNA-seq.
+
+### Paired genomic, transcriptomic assays
+
+How many participants have paired genomic and transcriptomic samples for the same `tumor_descriptor` values?
+
+```{r}
+# if we perform a full join here using the pariticpant ID and descriptors, we 
+# will get NAs in columns where pairs don't exist and can use this information 
+# to count
+paired_df <-
+  full_join(genomic_df, transcriptomic_df,
+                   by = c("Kids_First_Participant_ID", "descriptors"))
+```
+
+#### All paired
+
+Count the examples where all time points have both kinds of assays.
+
+```{r}
+# no NAs = both kinds of assays are present
+paired_df %>%
+  filter(complete.cases(.)) %>%
+  group_by(descriptors) %>%
+  tally() %>%
+  arrange(desc(n)) %>% 
+  regulartable() %>%
+  fontsize(size = 12, part="all")
+```
+
+#### No RNA-seq
+
+If the `experimental_strategy.y` column has an `NA`, RNA-seq is missing for that participant-descriptors pair.
+
+```{r}
+paired_df %>%
+  filter(is.na(experimental_strategy.y)) %>%
+  group_by(descriptors) %>%
+  tally() %>% 
+  regulartable() %>%
+  fontsize(size = 12, part = "all")
+```
+
+#### No WXS or WGS
+
+If the `experimental_strategy.x` column has an `NA`, there is no WGS or WXS for that participant-descriptors pair.
+
+```{r}
+paired_df %>%
+  filter(is.na(experimental_strategy.x)) %>%
+  group_by(descriptors) %>%
+  tally() %>% 
+  regulartable() %>%
+  fontsize(size = 12, part = "all")
+```
+
+#### Duplicate participant IDs
+
+There are cases where one timepoint (e.g., `tumor_descriptor` value) is missing from one kind of assay but not the other.
+This will show up as duplicated values in the `Kids_First_Participant_ID` column.
+
+```{r}
+ids_with_dups <- paired_df %>%
+  filter(duplicated(Kids_First_Participant_ID)) %>%
+  pull(Kids_First_Participant_ID)
+```
+
+There are `r length(ids_with_dups)` cases of this.
+Let's see what this looks like.
+
+```{r}
+paired_df %>%
+  filter(Kids_First_Participant_ID %in% ids_with_dups) %>%
+  arrange(Kids_First_Participant_ID)  %>% 
+  regulartable() %>%
+  fontsize(size = 12, part = "all")
+```
+
+*Two examples for interpretation:*
+
+So for the `PT_3KM9W8S8` participant ID, there is primary WGS and RNA-seq data, but only RNA-seq data for progressive disease.
+For `PT_HT4HJXY6`, there is WGS data for both the primary CNS tumor and the second malignancy, but RNA-seq data only for the second malignancy.
+
+## By histology
+
+We're going to use the `disease_type_new` column here.
+
+### Primary only
+
+We're including *any* assay type.
+This table is different than what is plotted upstream in this module because we didn't restrict that to `Initial CNS Tumor` only.
+
+```{r}
+tumor_df %>%
+  filter(tumor_descriptor == "Initial CNS Tumor") %>%
+  distinct(Kids_First_Participant_ID, disease_type_new) %>%
+  group_by(disease_type_new) %>%
+  tally() %>%
+  arrange(desc(n)) %>% 
+  regulartable() %>%
+  fontsize(size = 12, part = "all")
+```
+
+### Disease type - descriptors pairs
+
+```{r}
+disease_types_descriptors <- tumor_df %>%
+  group_by(Kids_First_Participant_ID) %>%
+  summarize(disease_types = paste(sort(unique(disease_type_new)),
+                                  collapse = ", "),
+            descriptors = paste(sort(unique(tumor_descriptor)),
+                                collapse = ", ")) %>%
+  group_by(disease_types, descriptors) %>%
+  tally() %>%
+  arrange(desc(n))
+
+disease_types_descriptors %>% 
+  regulartable() %>%
+  fontsize(size = 12, part = "all")
+```
+
+#### Remove primary only
+
+```{r}
+disease_types_descriptors %>%
+  filter(descriptors != "Initial CNS Tumor")  %>% 
+  regulartable() %>%
+  fontsize(size = 12, part = "all")
+```
+
+What about when the primary tumor is present and paired with another point in time?
+
+```{r}
+disease_types_descriptors %>%
+  filter(descriptors != "Initial CNS Tumor")  %>% 
+  filter(grepl("Initial CNS Tumor", descriptors)) %>%
+  regulartable() %>%
+  fontsize(size = 12, part = "all")
+```
+
diff --git a/analyses/sample-distribution-analysis/03-tumor-descriptor-and-assay-count.nb.html b/analyses/sample-distribution-analysis/03-tumor-descriptor-and-assay-count.nb.html
diff --git a/analyses/sample-distribution-analysis/README.md b/analyses/sample-distribution-analysis/README.md
@@ -18,26 +18,33 @@ This folder contains scripts tasked to analyze the distribution of samples acros
 This script produces TSV files containing the counts and percentages of samples across each unique cancer type, and the types sorted in order of descending expression at each unique brain location.   
 
 `02-multilayer-pie.R` is a script written to produce an interactive treemap and multilayer pie chart representing the distribution of samples across broad histologies, short histologies, and tumor types.  
+
+ The interactive treemap can be found [here](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/sample-distribution-analysis/plots/histology-treemap.html).  
+ The interactive, multilayer pie chart can be found [here](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/plots/sample-distribution-analysis/histology-pie.html).
+
+
+The `03-tumor-descriptor-and-assay-count` notebook contains a series of tables that count the number of each assay type and example primary vs. recurrence broken down by histology. 
+View the notebook [here](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/sample-distribution-analysis/03-tumor-descriptor-and-assay-count.nb.html).
+
 
- The interactive treemap can be found [here](alexslemonade.github.io/OpenPBTA-analysis/histology-treemap.html).  
- The interactive, multilayer pie chart can be found [here](alexslemonade.github.io/OpenPBTA-analysis/histology-pie.html).
 
 ## Folder structure 
 
 The structure of this folder is as follows:
 
 ```
-sample-distribution-analysis
-├── README.md
 ├── 01-filter-across-types.R
 ├── 02-multilayer-plots.R
+├── 03-tumor-descriptor-and-assay-count.Rmd
+├── 03-tumor-descriptor-and-assay-count.nb.html
+├── README.md
 ├── plots
 │   ├── distribution_across_cancer_types.pdf
-│   ├── histology-pie.html
+│   ├── histology-pie.html
 │   └── histology-treemap.html
 ├── results
 │   ├── disease_expression.tsv
-│   ├── primary_sites.tsv
-│   └── sunburst_plot_df.tsv
-
+│   ├── plots_df.tsv
+│   └── primary_sites_counts.tsv
+└── run-sample-distribution.sh
 ```
diff --git a/analyses/sample-distribution-analysis/plots/distribution_across_cancer_types.pdf b/analyses/sample-distribution-analysis/plots/distribution_across_cancer_types.pdf