Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Additional tables for sample distribution: breakdown by tumor descriptor #213

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 11 additions & 6 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ RUN apt-get update -qq && apt-get -y --no-install-recommends install \

# Required forinteractive sample distribution plots
# map view is needed to create HTML outputs of the interactive plots
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
&& install2.r --error \
--deps TRUE \
gdalUtils \
Expand All @@ -28,9 +28,9 @@ RUN apt-get update -qq && apt-get -y --no-install-recommends install \
mapview

# Installs packages needed for still treemap, interactive plots, and hex plots
# Rtsne and umap are required for dimension reduction analyses
# Rtsne and umap are required for dimension reduction analyses
# optparse is needed for passing arguments from the command line to R script
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
&& install2.r --error \
--deps TRUE \
R.utils \
Expand Down Expand Up @@ -72,7 +72,7 @@ RUN R -e "devtools::install_github('timelyportfolio/d3treeR', ref = '0eaba7f1c64
RUN R -e "devtools::install_github('clauswilke/colorblindr', ref = '1ac3d4d62dad047b68bb66c06cee927a4517d678', dependencies = TRUE)"

# Required for sex prediction from RNA-seq data
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
&& install2.r --error \
--deps TRUE \
glmnet \
Expand All @@ -86,21 +86,26 @@ RUN apt-get -y update && apt-get install -y \
&& rm -rf /var/lib/apt/lists/

# Install for SNV comparison plots
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
&& install2.r --error \
--deps TRUE \
UpSetR

RUN R -e "devtools::install_github('const-ae/ggupset', ref = '7a33263cc5fafdd72a5bfcbebe5185fafe050c73', dependencies = TRUE)"

# GGally and its required packages
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
&& install2.r --error \
lattice \
rpart \
class \
MASS \
GGally

# Help display tables in R Notebooks
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
&& install2.r --error \
flextable

#### Please install your dependencies here
#### Add a comment to indicate what analysis it is required for
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,9 @@ ggplot2::ggsave(
height = 10
)

# TODO: rerun this when the primary_site column gets fixed
# https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/214

# data.frame with the location where each cancer type in the dataset is
# expressed, sorted to show highest expression
brain_location <- histologies_df %>%
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
---
title: "Examine `tumor_descriptor` and `experimental_strategy` distributions"
output:
html_notebook:
toc: TRUE
toc_float: TRUE
author: J. Taroni for ALSF CCDL
date: 2019
---

In this notebook, we will explore the distribution of primary vs. other samples, as there are multiple samples from the same individual ([#155](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/155)).
Here, "other" can refer to either samples/biospecimens from progressive disease or recurrence.

We have independent sets of samples for genomic assays (e.g., WGS, WXS; described [here](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/README.md#data-formats)) generated by @jashapiro for use in downstream analyses.
(You can see more background in [this notebook](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/independent-samples/00-repeated-samples.nb.html).)

We can imagine that there are some analyses that may compare primary vs. recurrence that expect both WGS/WXS _and_ RNA-seq data.

We have not yet looked at:

* How many pairs of genomic and transcriptomic assays are there, i.e., RNA-Seq and WGS from the same timepoint for a participant?
* What is breakdown by **histology** for cases where there are multiple samples from the same individual?

```{r}
library(dplyr)
# this library will help display tables with smaller font for easy viewing
# in the HTML setting
library(flextable)
```

## Read in histologies file

`pbta-histologies.tsv` contains all the clinical information.

```{r}
histology_file <- file.path("..", "..", "data", "pbta-histologies.tsv")
histology_df <- readr::read_tsv(histology_file)
```

For WGS and WXS samples, we'll have tumor and normal to get the somatic calls.
We'll limit this to tumor samples to avoid double-counting participants and we'll remove derived cell lines.

```{r}
tumor_df <- histology_df %>%
filter(sample_type == "Tumor",
composition == "Solid Tissue")
```

## Number of each assay

First, we'll examine how many of each type of assay we have (tumors only).
This information is stored in the `experimental_strategy` column.

```{r}
tumor_df %>%
group_by(experimental_strategy) %>%
tally() %>%
arrange(desc(n)) %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

## Primary vs. other (recurrence, progressive disease)

We'll look at the breakdown of the `tumor_descriptor` column, separating the genomic assays from transcriptomic assays.

```{r}
tumor_descriptor_df <- tumor_df %>%
select(Kids_First_Participant_ID,
experimental_strategy,
tumor_descriptor) %>%
arrange(Kids_First_Participant_ID)
```

### Genomic assays

Setting aside the `Panel` sample for the moment and only looking at WXS and WGS assays.
We're collapsing the different values in `tumor_descriptor` to form a single descriptor when there are multiple types of tumors from the same individual.

```{r}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a description of what you are doing here? It is seems from the code that you are collapsing multiple samples to create a single descriptor when there is more than one kind of tumor, but it seems worth noting in text.

genomic_df <- tumor_descriptor_df %>%
filter(experimental_strategy %in% c("WGS", "WXS")) %>%
group_by(Kids_First_Participant_ID) %>%
summarize(descriptors = paste(sort(unique(tumor_descriptor)),
collapse = ", "),
experimental_strategy = paste(sort(unique(experimental_strategy)),
collapse = ", "))
```

```{r}
genomic_df %>%
group_by(descriptors) %>%
tally() %>%
arrange(desc(n)) %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

Only primary samples is the most common case for genomic assays.

### Transcriptomic assays

Looking only at RNA-seq samples and performing the same collapsing of the `tumor_descriptor` column.

```{r}
transcriptomic_df <- tumor_descriptor_df %>%
filter(experimental_strategy == "RNA-Seq") %>%
group_by(Kids_First_Participant_ID) %>%
summarize(descriptors = paste(sort(unique(tumor_descriptor)),
collapse = ", "),
experimental_strategy = unique(experimental_strategy))
```

```{r}
transcriptomic_df %>%
group_by(descriptors) %>%
tally() %>%
arrange(desc(n)) %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

The same holds for RNA-seq.

### Paired genomic, transcriptomic assays

How many participants have paired genomic and transcriptomic samples for the same `tumor_descriptor` values?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here seems to require all descriptors to be paired for each participant? So if a participant had genomic and transcriptomic samples for Initial CNS tumor, but not both for Progressive, say, then it wouldn't be counted here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, another few lines down, I see that this is the case, and you have addressed it. But I do think it leaves the tables up here somewhat hard to interpret. At first glance, it appears that there may be many missing WGS samples, but in fact a portion of those are where there is actually missing RNAseq data making the descriptors not match.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take out looking at paired_df in line 134. I think it's more confusing than helpful.


```{r}
# if we perform a full join here using the pariticpant ID and descriptors, we
# will get NAs in columns where pairs don't exist and can use this information
# to count
paired_df <-
full_join(genomic_df, transcriptomic_df,
by = c("Kids_First_Participant_ID", "descriptors"))
```

#### All paired

Count the examples where all time points have both kinds of assays.

```{r}
# no NAs = both kinds of assays are present
paired_df %>%
filter(complete.cases(.)) %>%
group_by(descriptors) %>%
tally() %>%
arrange(desc(n)) %>%
regulartable() %>%
fontsize(size = 12, part="all")
```

#### No RNA-seq

If the `experimental_strategy.y` column has an `NA`, RNA-seq is missing for that participant-descriptors pair.

```{r}
paired_df %>%
filter(is.na(experimental_strategy.y)) %>%
group_by(descriptors) %>%
tally() %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

#### No WXS or WGS

If the `experimental_strategy.x` column has an `NA`, there is no WGS or WXS for that participant-descriptors pair.

```{r}
paired_df %>%
filter(is.na(experimental_strategy.x)) %>%
group_by(descriptors) %>%
tally() %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

#### Duplicate participant IDs

There are cases where one timepoint (e.g., `tumor_descriptor` value) is missing from one kind of assay but not the other.
This will show up as duplicated values in the `Kids_First_Participant_ID` column.

```{r}
ids_with_dups <- paired_df %>%
filter(duplicated(Kids_First_Participant_ID)) %>%
pull(Kids_First_Participant_ID)
```

There are `r length(ids_with_dups)` cases of this.
Let's see what this looks like.

```{r}
paired_df %>%
filter(Kids_First_Participant_ID %in% ids_with_dups) %>%
arrange(Kids_First_Participant_ID) %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

*Two examples for interpretation:*

So for the `PT_3KM9W8S8` participant ID, there is primary WGS and RNA-seq data, but only RNA-seq data for progressive disease.
For `PT_HT4HJXY6`, there is WGS data for both the primary CNS tumor and the second malignancy, but RNA-seq data only for the second malignancy.

## By histology

We're going to use the `disease_type_new` column here.

### Primary only

We're including *any* assay type.
This table is different than what is plotted upstream in this module because we didn't restrict that to `Initial CNS Tumor` only.

```{r}
tumor_df %>%
filter(tumor_descriptor == "Initial CNS Tumor") %>%
distinct(Kids_First_Participant_ID, disease_type_new) %>%
group_by(disease_type_new) %>%
tally() %>%
arrange(desc(n)) %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have questions about some of these descriptors, and the meaning of delimiters. For example, we have: Ganglioglioma;Low-grade glioma/astrocytoma (WHO grade I/II), as well as Ganglioglioma on its own, and Low-grade glioma;astrocytoma (WHO grade I/II) on its own. Is the first some kind of combination of the latter two? Is there a semantic difference between the use of ; and / in these cases? This PR is probably not the right place for this discussion...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, these are all derived from the way they were added to the database - seems the \ should not be between LGG/astrocytoma, but that should be a ;. Ganglioglioma and LGG/astrocytoma are different and only shown together since they were both on the pathology report. If possible, through genomic analyses, we could further dial into the diagnoses, we can possibly replace with one or the other.

### Disease type - descriptors pairs

```{r}
disease_types_descriptors <- tumor_df %>%
group_by(Kids_First_Participant_ID) %>%
summarize(disease_types = paste(sort(unique(disease_type_new)),
collapse = ", "),
descriptors = paste(sort(unique(tumor_descriptor)),
collapse = ", ")) %>%
group_by(disease_types, descriptors) %>%
tally() %>%
arrange(desc(n))

disease_types_descriptors %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

#### Remove primary only

```{r}
disease_types_descriptors %>%
filter(descriptors != "Initial CNS Tumor") %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

What about when the primary tumor is present and paired with another point in time?

```{r}
disease_types_descriptors %>%
filter(descriptors != "Initial CNS Tumor") %>%
filter(grepl("Initial CNS Tumor", descriptors)) %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

Large diffs are not rendered by default.

23 changes: 15 additions & 8 deletions analyses/sample-distribution-analysis/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,26 +18,33 @@ This folder contains scripts tasked to analyze the distribution of samples acros
This script produces TSV files containing the counts and percentages of samples across each unique cancer type, and the types sorted in order of descending expression at each unique brain location.

`02-multilayer-pie.R` is a script written to produce an interactive treemap and multilayer pie chart representing the distribution of samples across broad histologies, short histologies, and tumor types.

The interactive treemap can be found [here](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/sample-distribution-analysis/plots/histology-treemap.html).
The interactive, multilayer pie chart can be found [here](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/plots/sample-distribution-analysis/histology-pie.html).


The `03-tumor-descriptor-and-assay-count` notebook contains a series of tables that count the number of each assay type and example primary vs. recurrence broken down by histology.
View the notebook [here](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/sample-distribution-analysis/03-tumor-descriptor-and-assay-count.nb.html).


The interactive treemap can be found [here](alexslemonade.github.io/OpenPBTA-analysis/histology-treemap.html).
The interactive, multilayer pie chart can be found [here](alexslemonade.github.io/OpenPBTA-analysis/histology-pie.html).

## Folder structure

The structure of this folder is as follows:

```
sample-distribution-analysis
├── README.md
├── 01-filter-across-types.R
├── 02-multilayer-plots.R
├── 03-tumor-descriptor-and-assay-count.Rmd
├── 03-tumor-descriptor-and-assay-count.nb.html
├── README.md
├── plots
│   ├── distribution_across_cancer_types.pdf
│  ├── histology-pie.html
│   ├── histology-pie.html
│   └── histology-treemap.html
├── results
│   ├── disease_expression.tsv
│  ├── primary_sites.tsv
│   └── sunburst_plot_df.tsv

│   ├── plots_df.tsv
│   └── primary_sites_counts.tsv
└── run-sample-distribution.sh
```
Binary file not shown.
Loading