This repository has been archived by the owner on Jun 21, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 67
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Additional tables for sample distribution: breakdown by tumor descrip…
…tor (#213) * Add flextable to Docker * Add notebook looking at tumor_descriptor breakdown * Add notebook to shell script; rerun Using v7 data * Add table examining more than one timepoint per histology And rerun * Update module-specific README * Add TODO re: primary_site column * Response to @jashapiro comments
- Loading branch information
1 parent
59f8fda
commit 615fdf5
Showing
12 changed files
with
3,781 additions
and
131 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
262 changes: 262 additions & 0 deletions
262
analyses/sample-distribution-analysis/03-tumor-descriptor-and-assay-count.Rmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,262 @@ | ||
--- | ||
title: "Examine `tumor_descriptor` and `experimental_strategy` distributions" | ||
output: | ||
html_notebook: | ||
toc: TRUE | ||
toc_float: TRUE | ||
author: J. Taroni for ALSF CCDL | ||
date: 2019 | ||
--- | ||
|
||
In this notebook, we will explore the distribution of primary vs. other samples, as there are multiple samples from the same individual ([#155](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/155)). | ||
Here, "other" can refer to either samples/biospecimens from progressive disease or recurrence. | ||
|
||
We have independent sets of samples for genomic assays (e.g., WGS, WXS; described [here](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/README.md#data-formats)) generated by @jashapiro for use in downstream analyses. | ||
(You can see more background in [this notebook](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/independent-samples/00-repeated-samples.nb.html).) | ||
|
||
We can imagine that there are some analyses that may compare primary vs. recurrence that expect both WGS/WXS _and_ RNA-seq data. | ||
|
||
We have not yet looked at: | ||
|
||
* How many pairs of genomic and transcriptomic assays are there, i.e., RNA-Seq and WGS from the same timepoint for a participant? | ||
* What is breakdown by **histology** for cases where there are multiple samples from the same individual? | ||
|
||
```{r} | ||
library(dplyr) | ||
# this library will help display tables with smaller font for easy viewing | ||
# in the HTML setting | ||
library(flextable) | ||
``` | ||
|
||
## Read in histologies file | ||
|
||
`pbta-histologies.tsv` contains all the clinical information. | ||
|
||
```{r} | ||
histology_file <- file.path("..", "..", "data", "pbta-histologies.tsv") | ||
histology_df <- readr::read_tsv(histology_file) | ||
``` | ||
|
||
For WGS and WXS samples, we'll have tumor and normal to get the somatic calls. | ||
We'll limit this to tumor samples to avoid double-counting participants and we'll remove derived cell lines. | ||
|
||
```{r} | ||
tumor_df <- histology_df %>% | ||
filter(sample_type == "Tumor", | ||
composition == "Solid Tissue") | ||
``` | ||
|
||
## Number of each assay | ||
|
||
First, we'll examine how many of each type of assay we have (tumors only). | ||
This information is stored in the `experimental_strategy` column. | ||
|
||
```{r} | ||
tumor_df %>% | ||
group_by(experimental_strategy) %>% | ||
tally() %>% | ||
arrange(desc(n)) %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
## Primary vs. other (recurrence, progressive disease) | ||
|
||
We'll look at the breakdown of the `tumor_descriptor` column, separating the genomic assays from transcriptomic assays. | ||
|
||
```{r} | ||
tumor_descriptor_df <- tumor_df %>% | ||
select(Kids_First_Participant_ID, | ||
experimental_strategy, | ||
tumor_descriptor) %>% | ||
arrange(Kids_First_Participant_ID) | ||
``` | ||
|
||
### Genomic assays | ||
|
||
Setting aside the `Panel` sample for the moment and only looking at WXS and WGS assays. | ||
We're collapsing the different values in `tumor_descriptor` to form a single descriptor when there are multiple types of tumors from the same individual. | ||
|
||
```{r} | ||
genomic_df <- tumor_descriptor_df %>% | ||
filter(experimental_strategy %in% c("WGS", "WXS")) %>% | ||
group_by(Kids_First_Participant_ID) %>% | ||
summarize(descriptors = paste(sort(unique(tumor_descriptor)), | ||
collapse = ", "), | ||
experimental_strategy = paste(sort(unique(experimental_strategy)), | ||
collapse = ", ")) | ||
``` | ||
|
||
```{r} | ||
genomic_df %>% | ||
group_by(descriptors) %>% | ||
tally() %>% | ||
arrange(desc(n)) %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
Only primary samples is the most common case for genomic assays. | ||
|
||
### Transcriptomic assays | ||
|
||
Looking only at RNA-seq samples and performing the same collapsing of the `tumor_descriptor` column. | ||
|
||
```{r} | ||
transcriptomic_df <- tumor_descriptor_df %>% | ||
filter(experimental_strategy == "RNA-Seq") %>% | ||
group_by(Kids_First_Participant_ID) %>% | ||
summarize(descriptors = paste(sort(unique(tumor_descriptor)), | ||
collapse = ", "), | ||
experimental_strategy = unique(experimental_strategy)) | ||
``` | ||
|
||
```{r} | ||
transcriptomic_df %>% | ||
group_by(descriptors) %>% | ||
tally() %>% | ||
arrange(desc(n)) %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
The same holds for RNA-seq. | ||
|
||
### Paired genomic, transcriptomic assays | ||
|
||
How many participants have paired genomic and transcriptomic samples for the same `tumor_descriptor` values? | ||
|
||
```{r} | ||
# if we perform a full join here using the pariticpant ID and descriptors, we | ||
# will get NAs in columns where pairs don't exist and can use this information | ||
# to count | ||
paired_df <- | ||
full_join(genomic_df, transcriptomic_df, | ||
by = c("Kids_First_Participant_ID", "descriptors")) | ||
``` | ||
|
||
#### All paired | ||
|
||
Count the examples where all time points have both kinds of assays. | ||
|
||
```{r} | ||
# no NAs = both kinds of assays are present | ||
paired_df %>% | ||
filter(complete.cases(.)) %>% | ||
group_by(descriptors) %>% | ||
tally() %>% | ||
arrange(desc(n)) %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part="all") | ||
``` | ||
|
||
#### No RNA-seq | ||
|
||
If the `experimental_strategy.y` column has an `NA`, RNA-seq is missing for that participant-descriptors pair. | ||
|
||
```{r} | ||
paired_df %>% | ||
filter(is.na(experimental_strategy.y)) %>% | ||
group_by(descriptors) %>% | ||
tally() %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
#### No WXS or WGS | ||
|
||
If the `experimental_strategy.x` column has an `NA`, there is no WGS or WXS for that participant-descriptors pair. | ||
|
||
```{r} | ||
paired_df %>% | ||
filter(is.na(experimental_strategy.x)) %>% | ||
group_by(descriptors) %>% | ||
tally() %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
#### Duplicate participant IDs | ||
|
||
There are cases where one timepoint (e.g., `tumor_descriptor` value) is missing from one kind of assay but not the other. | ||
This will show up as duplicated values in the `Kids_First_Participant_ID` column. | ||
|
||
```{r} | ||
ids_with_dups <- paired_df %>% | ||
filter(duplicated(Kids_First_Participant_ID)) %>% | ||
pull(Kids_First_Participant_ID) | ||
``` | ||
|
||
There are `r length(ids_with_dups)` cases of this. | ||
Let's see what this looks like. | ||
|
||
```{r} | ||
paired_df %>% | ||
filter(Kids_First_Participant_ID %in% ids_with_dups) %>% | ||
arrange(Kids_First_Participant_ID) %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
*Two examples for interpretation:* | ||
|
||
So for the `PT_3KM9W8S8` participant ID, there is primary WGS and RNA-seq data, but only RNA-seq data for progressive disease. | ||
For `PT_HT4HJXY6`, there is WGS data for both the primary CNS tumor and the second malignancy, but RNA-seq data only for the second malignancy. | ||
|
||
## By histology | ||
|
||
We're going to use the `disease_type_new` column here. | ||
|
||
### Primary only | ||
|
||
We're including *any* assay type. | ||
This table is different than what is plotted upstream in this module because we didn't restrict that to `Initial CNS Tumor` only. | ||
|
||
```{r} | ||
tumor_df %>% | ||
filter(tumor_descriptor == "Initial CNS Tumor") %>% | ||
distinct(Kids_First_Participant_ID, disease_type_new) %>% | ||
group_by(disease_type_new) %>% | ||
tally() %>% | ||
arrange(desc(n)) %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
### Disease type - descriptors pairs | ||
|
||
```{r} | ||
disease_types_descriptors <- tumor_df %>% | ||
group_by(Kids_First_Participant_ID) %>% | ||
summarize(disease_types = paste(sort(unique(disease_type_new)), | ||
collapse = ", "), | ||
descriptors = paste(sort(unique(tumor_descriptor)), | ||
collapse = ", ")) %>% | ||
group_by(disease_types, descriptors) %>% | ||
tally() %>% | ||
arrange(desc(n)) | ||
disease_types_descriptors %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
#### Remove primary only | ||
|
||
```{r} | ||
disease_types_descriptors %>% | ||
filter(descriptors != "Initial CNS Tumor") %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
What about when the primary tumor is present and paired with another point in time? | ||
|
||
```{r} | ||
disease_types_descriptors %>% | ||
filter(descriptors != "Initial CNS Tumor") %>% | ||
filter(grepl("Initial CNS Tumor", descriptors)) %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
3,371 changes: 3,371 additions & 0 deletions
3,371
analyses/sample-distribution-analysis/03-tumor-descriptor-and-assay-count.nb.html
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file modified
BIN
-1 Byte
(100%)
analyses/sample-distribution-analysis/plots/distribution_across_cancer_types.pdf
Binary file not shown.
Oops, something went wrong.