-
Notifications
You must be signed in to change notification settings - Fork 67
Additional tables for sample distribution: breakdown by tumor descriptor #213
Changes from all commits
521b61d
5a3d83e
228b704
c1124ca
1f26222
cbef8e3
e28b497
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,262 @@ | ||
--- | ||
title: "Examine `tumor_descriptor` and `experimental_strategy` distributions" | ||
output: | ||
html_notebook: | ||
toc: TRUE | ||
toc_float: TRUE | ||
author: J. Taroni for ALSF CCDL | ||
date: 2019 | ||
--- | ||
|
||
In this notebook, we will explore the distribution of primary vs. other samples, as there are multiple samples from the same individual ([#155](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/155)). | ||
Here, "other" can refer to either samples/biospecimens from progressive disease or recurrence. | ||
|
||
We have independent sets of samples for genomic assays (e.g., WGS, WXS; described [here](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/README.md#data-formats)) generated by @jashapiro for use in downstream analyses. | ||
(You can see more background in [this notebook](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/independent-samples/00-repeated-samples.nb.html).) | ||
|
||
We can imagine that there are some analyses that may compare primary vs. recurrence that expect both WGS/WXS _and_ RNA-seq data. | ||
|
||
We have not yet looked at: | ||
|
||
* How many pairs of genomic and transcriptomic assays are there, i.e., RNA-Seq and WGS from the same timepoint for a participant? | ||
* What is breakdown by **histology** for cases where there are multiple samples from the same individual? | ||
|
||
```{r} | ||
library(dplyr) | ||
# this library will help display tables with smaller font for easy viewing | ||
# in the HTML setting | ||
library(flextable) | ||
``` | ||
|
||
## Read in histologies file | ||
|
||
`pbta-histologies.tsv` contains all the clinical information. | ||
|
||
```{r} | ||
histology_file <- file.path("..", "..", "data", "pbta-histologies.tsv") | ||
histology_df <- readr::read_tsv(histology_file) | ||
``` | ||
|
||
For WGS and WXS samples, we'll have tumor and normal to get the somatic calls. | ||
We'll limit this to tumor samples to avoid double-counting participants and we'll remove derived cell lines. | ||
|
||
```{r} | ||
tumor_df <- histology_df %>% | ||
filter(sample_type == "Tumor", | ||
composition == "Solid Tissue") | ||
``` | ||
|
||
## Number of each assay | ||
|
||
First, we'll examine how many of each type of assay we have (tumors only). | ||
This information is stored in the `experimental_strategy` column. | ||
|
||
```{r} | ||
tumor_df %>% | ||
group_by(experimental_strategy) %>% | ||
tally() %>% | ||
arrange(desc(n)) %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
## Primary vs. other (recurrence, progressive disease) | ||
|
||
We'll look at the breakdown of the `tumor_descriptor` column, separating the genomic assays from transcriptomic assays. | ||
|
||
```{r} | ||
tumor_descriptor_df <- tumor_df %>% | ||
select(Kids_First_Participant_ID, | ||
experimental_strategy, | ||
tumor_descriptor) %>% | ||
arrange(Kids_First_Participant_ID) | ||
``` | ||
|
||
### Genomic assays | ||
|
||
Setting aside the `Panel` sample for the moment and only looking at WXS and WGS assays. | ||
We're collapsing the different values in `tumor_descriptor` to form a single descriptor when there are multiple types of tumors from the same individual. | ||
|
||
```{r} | ||
genomic_df <- tumor_descriptor_df %>% | ||
filter(experimental_strategy %in% c("WGS", "WXS")) %>% | ||
group_by(Kids_First_Participant_ID) %>% | ||
summarize(descriptors = paste(sort(unique(tumor_descriptor)), | ||
collapse = ", "), | ||
experimental_strategy = paste(sort(unique(experimental_strategy)), | ||
collapse = ", ")) | ||
``` | ||
|
||
```{r} | ||
genomic_df %>% | ||
group_by(descriptors) %>% | ||
tally() %>% | ||
arrange(desc(n)) %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
Only primary samples is the most common case for genomic assays. | ||
|
||
### Transcriptomic assays | ||
|
||
Looking only at RNA-seq samples and performing the same collapsing of the `tumor_descriptor` column. | ||
|
||
```{r} | ||
transcriptomic_df <- tumor_descriptor_df %>% | ||
filter(experimental_strategy == "RNA-Seq") %>% | ||
group_by(Kids_First_Participant_ID) %>% | ||
summarize(descriptors = paste(sort(unique(tumor_descriptor)), | ||
collapse = ", "), | ||
experimental_strategy = unique(experimental_strategy)) | ||
``` | ||
|
||
```{r} | ||
transcriptomic_df %>% | ||
group_by(descriptors) %>% | ||
tally() %>% | ||
arrange(desc(n)) %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
The same holds for RNA-seq. | ||
|
||
### Paired genomic, transcriptomic assays | ||
|
||
How many participants have paired genomic and transcriptomic samples for the same `tumor_descriptor` values? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The code here seems to require all descriptors to be paired for each participant? So if a participant had genomic and transcriptomic samples for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, another few lines down, I see that this is the case, and you have addressed it. But I do think it leaves the tables up here somewhat hard to interpret. At first glance, it appears that there may be many missing WGS samples, but in fact a portion of those are where there is actually missing RNAseq data making the descriptors not match. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can take out looking at |
||
|
||
```{r} | ||
# if we perform a full join here using the pariticpant ID and descriptors, we | ||
# will get NAs in columns where pairs don't exist and can use this information | ||
# to count | ||
paired_df <- | ||
full_join(genomic_df, transcriptomic_df, | ||
by = c("Kids_First_Participant_ID", "descriptors")) | ||
``` | ||
|
||
#### All paired | ||
|
||
Count the examples where all time points have both kinds of assays. | ||
|
||
```{r} | ||
# no NAs = both kinds of assays are present | ||
paired_df %>% | ||
filter(complete.cases(.)) %>% | ||
group_by(descriptors) %>% | ||
tally() %>% | ||
arrange(desc(n)) %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part="all") | ||
``` | ||
|
||
#### No RNA-seq | ||
|
||
If the `experimental_strategy.y` column has an `NA`, RNA-seq is missing for that participant-descriptors pair. | ||
|
||
```{r} | ||
paired_df %>% | ||
filter(is.na(experimental_strategy.y)) %>% | ||
group_by(descriptors) %>% | ||
tally() %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
#### No WXS or WGS | ||
|
||
If the `experimental_strategy.x` column has an `NA`, there is no WGS or WXS for that participant-descriptors pair. | ||
|
||
```{r} | ||
paired_df %>% | ||
filter(is.na(experimental_strategy.x)) %>% | ||
group_by(descriptors) %>% | ||
tally() %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
#### Duplicate participant IDs | ||
|
||
There are cases where one timepoint (e.g., `tumor_descriptor` value) is missing from one kind of assay but not the other. | ||
This will show up as duplicated values in the `Kids_First_Participant_ID` column. | ||
|
||
```{r} | ||
ids_with_dups <- paired_df %>% | ||
filter(duplicated(Kids_First_Participant_ID)) %>% | ||
pull(Kids_First_Participant_ID) | ||
``` | ||
|
||
There are `r length(ids_with_dups)` cases of this. | ||
Let's see what this looks like. | ||
|
||
```{r} | ||
paired_df %>% | ||
filter(Kids_First_Participant_ID %in% ids_with_dups) %>% | ||
arrange(Kids_First_Participant_ID) %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
*Two examples for interpretation:* | ||
|
||
So for the `PT_3KM9W8S8` participant ID, there is primary WGS and RNA-seq data, but only RNA-seq data for progressive disease. | ||
For `PT_HT4HJXY6`, there is WGS data for both the primary CNS tumor and the second malignancy, but RNA-seq data only for the second malignancy. | ||
|
||
## By histology | ||
|
||
We're going to use the `disease_type_new` column here. | ||
|
||
### Primary only | ||
|
||
We're including *any* assay type. | ||
This table is different than what is plotted upstream in this module because we didn't restrict that to `Initial CNS Tumor` only. | ||
|
||
```{r} | ||
tumor_df %>% | ||
filter(tumor_descriptor == "Initial CNS Tumor") %>% | ||
distinct(Kids_First_Participant_ID, disease_type_new) %>% | ||
group_by(disease_type_new) %>% | ||
tally() %>% | ||
arrange(desc(n)) %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have questions about some of these descriptors, and the meaning of delimiters. For example, we have: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, these are all derived from the way they were added to the database - seems the |
||
### Disease type - descriptors pairs | ||
|
||
```{r} | ||
disease_types_descriptors <- tumor_df %>% | ||
group_by(Kids_First_Participant_ID) %>% | ||
summarize(disease_types = paste(sort(unique(disease_type_new)), | ||
collapse = ", "), | ||
descriptors = paste(sort(unique(tumor_descriptor)), | ||
collapse = ", ")) %>% | ||
group_by(disease_types, descriptors) %>% | ||
tally() %>% | ||
arrange(desc(n)) | ||
|
||
disease_types_descriptors %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
#### Remove primary only | ||
|
||
```{r} | ||
disease_types_descriptors %>% | ||
filter(descriptors != "Initial CNS Tumor") %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
||
What about when the primary tumor is present and paired with another point in time? | ||
|
||
```{r} | ||
disease_types_descriptors %>% | ||
filter(descriptors != "Initial CNS Tumor") %>% | ||
filter(grepl("Initial CNS Tumor", descriptors)) %>% | ||
regulartable() %>% | ||
fontsize(size = 12, part = "all") | ||
``` | ||
|
Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a description of what you are doing here? It is seems from the code that you are collapsing multiple samples to create a single descriptor when there is more than one kind of tumor, but it seems worth noting in text.