Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Commit

Permalink
Additional tables for sample distribution: breakdown by tumor descrip…
Browse files Browse the repository at this point in the history
…tor (#213)

* Add flextable to Docker

* Add notebook looking at tumor_descriptor breakdown

* Add notebook to shell script; rerun

Using v7 data

* Add table examining more than one timepoint per histology

And rerun

* Update module-specific README

* Add TODO re: primary_site column

* Response to @jashapiro comments
  • Loading branch information
jaclyn-taroni authored Nov 4, 2019
1 parent 59f8fda commit 615fdf5
Show file tree
Hide file tree
Showing 12 changed files with 3,781 additions and 131 deletions.
17 changes: 11 additions & 6 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ RUN apt-get update -qq && apt-get -y --no-install-recommends install \

# Required forinteractive sample distribution plots
# map view is needed to create HTML outputs of the interactive plots
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
&& install2.r --error \
--deps TRUE \
gdalUtils \
Expand All @@ -28,9 +28,9 @@ RUN apt-get update -qq && apt-get -y --no-install-recommends install \
mapview

# Installs packages needed for still treemap, interactive plots, and hex plots
# Rtsne and umap are required for dimension reduction analyses
# Rtsne and umap are required for dimension reduction analyses
# optparse is needed for passing arguments from the command line to R script
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
&& install2.r --error \
--deps TRUE \
R.utils \
Expand Down Expand Up @@ -72,7 +72,7 @@ RUN R -e "devtools::install_github('timelyportfolio/d3treeR', ref = '0eaba7f1c64
RUN R -e "devtools::install_github('clauswilke/colorblindr', ref = '1ac3d4d62dad047b68bb66c06cee927a4517d678', dependencies = TRUE)"

# Required for sex prediction from RNA-seq data
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
&& install2.r --error \
--deps TRUE \
glmnet \
Expand All @@ -86,21 +86,26 @@ RUN apt-get -y update && apt-get install -y \
&& rm -rf /var/lib/apt/lists/

# Install for SNV comparison plots
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
&& install2.r --error \
--deps TRUE \
UpSetR

RUN R -e "devtools::install_github('const-ae/ggupset', ref = '7a33263cc5fafdd72a5bfcbebe5185fafe050c73', dependencies = TRUE)"

# GGally and its required packages
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
&& install2.r --error \
lattice \
rpart \
class \
MASS \
GGally

# Help display tables in R Notebooks
RUN apt-get update -qq && apt-get -y --no-install-recommends install \
&& install2.r --error \
flextable

#### Please install your dependencies here
#### Add a comment to indicate what analysis it is required for
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,9 @@ ggplot2::ggsave(
height = 10
)

# TODO: rerun this when the primary_site column gets fixed
# https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/214

# data.frame with the location where each cancer type in the dataset is
# expressed, sorted to show highest expression
brain_location <- histologies_df %>%
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
---
title: "Examine `tumor_descriptor` and `experimental_strategy` distributions"
output:
html_notebook:
toc: TRUE
toc_float: TRUE
author: J. Taroni for ALSF CCDL
date: 2019
---

In this notebook, we will explore the distribution of primary vs. other samples, as there are multiple samples from the same individual ([#155](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/155)).
Here, "other" can refer to either samples/biospecimens from progressive disease or recurrence.

We have independent sets of samples for genomic assays (e.g., WGS, WXS; described [here](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/README.md#data-formats)) generated by @jashapiro for use in downstream analyses.
(You can see more background in [this notebook](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/independent-samples/00-repeated-samples.nb.html).)

We can imagine that there are some analyses that may compare primary vs. recurrence that expect both WGS/WXS _and_ RNA-seq data.

We have not yet looked at:

* How many pairs of genomic and transcriptomic assays are there, i.e., RNA-Seq and WGS from the same timepoint for a participant?
* What is breakdown by **histology** for cases where there are multiple samples from the same individual?

```{r}
library(dplyr)
# this library will help display tables with smaller font for easy viewing
# in the HTML setting
library(flextable)
```

## Read in histologies file

`pbta-histologies.tsv` contains all the clinical information.

```{r}
histology_file <- file.path("..", "..", "data", "pbta-histologies.tsv")
histology_df <- readr::read_tsv(histology_file)
```

For WGS and WXS samples, we'll have tumor and normal to get the somatic calls.
We'll limit this to tumor samples to avoid double-counting participants and we'll remove derived cell lines.

```{r}
tumor_df <- histology_df %>%
filter(sample_type == "Tumor",
composition == "Solid Tissue")
```

## Number of each assay

First, we'll examine how many of each type of assay we have (tumors only).
This information is stored in the `experimental_strategy` column.

```{r}
tumor_df %>%
group_by(experimental_strategy) %>%
tally() %>%
arrange(desc(n)) %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

## Primary vs. other (recurrence, progressive disease)

We'll look at the breakdown of the `tumor_descriptor` column, separating the genomic assays from transcriptomic assays.

```{r}
tumor_descriptor_df <- tumor_df %>%
select(Kids_First_Participant_ID,
experimental_strategy,
tumor_descriptor) %>%
arrange(Kids_First_Participant_ID)
```

### Genomic assays

Setting aside the `Panel` sample for the moment and only looking at WXS and WGS assays.
We're collapsing the different values in `tumor_descriptor` to form a single descriptor when there are multiple types of tumors from the same individual.

```{r}
genomic_df <- tumor_descriptor_df %>%
filter(experimental_strategy %in% c("WGS", "WXS")) %>%
group_by(Kids_First_Participant_ID) %>%
summarize(descriptors = paste(sort(unique(tumor_descriptor)),
collapse = ", "),
experimental_strategy = paste(sort(unique(experimental_strategy)),
collapse = ", "))
```

```{r}
genomic_df %>%
group_by(descriptors) %>%
tally() %>%
arrange(desc(n)) %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

Only primary samples is the most common case for genomic assays.

### Transcriptomic assays

Looking only at RNA-seq samples and performing the same collapsing of the `tumor_descriptor` column.

```{r}
transcriptomic_df <- tumor_descriptor_df %>%
filter(experimental_strategy == "RNA-Seq") %>%
group_by(Kids_First_Participant_ID) %>%
summarize(descriptors = paste(sort(unique(tumor_descriptor)),
collapse = ", "),
experimental_strategy = unique(experimental_strategy))
```

```{r}
transcriptomic_df %>%
group_by(descriptors) %>%
tally() %>%
arrange(desc(n)) %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

The same holds for RNA-seq.

### Paired genomic, transcriptomic assays

How many participants have paired genomic and transcriptomic samples for the same `tumor_descriptor` values?

```{r}
# if we perform a full join here using the pariticpant ID and descriptors, we
# will get NAs in columns where pairs don't exist and can use this information
# to count
paired_df <-
full_join(genomic_df, transcriptomic_df,
by = c("Kids_First_Participant_ID", "descriptors"))
```

#### All paired

Count the examples where all time points have both kinds of assays.

```{r}
# no NAs = both kinds of assays are present
paired_df %>%
filter(complete.cases(.)) %>%
group_by(descriptors) %>%
tally() %>%
arrange(desc(n)) %>%
regulartable() %>%
fontsize(size = 12, part="all")
```

#### No RNA-seq

If the `experimental_strategy.y` column has an `NA`, RNA-seq is missing for that participant-descriptors pair.

```{r}
paired_df %>%
filter(is.na(experimental_strategy.y)) %>%
group_by(descriptors) %>%
tally() %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

#### No WXS or WGS

If the `experimental_strategy.x` column has an `NA`, there is no WGS or WXS for that participant-descriptors pair.

```{r}
paired_df %>%
filter(is.na(experimental_strategy.x)) %>%
group_by(descriptors) %>%
tally() %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

#### Duplicate participant IDs

There are cases where one timepoint (e.g., `tumor_descriptor` value) is missing from one kind of assay but not the other.
This will show up as duplicated values in the `Kids_First_Participant_ID` column.

```{r}
ids_with_dups <- paired_df %>%
filter(duplicated(Kids_First_Participant_ID)) %>%
pull(Kids_First_Participant_ID)
```

There are `r length(ids_with_dups)` cases of this.
Let's see what this looks like.

```{r}
paired_df %>%
filter(Kids_First_Participant_ID %in% ids_with_dups) %>%
arrange(Kids_First_Participant_ID) %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

*Two examples for interpretation:*

So for the `PT_3KM9W8S8` participant ID, there is primary WGS and RNA-seq data, but only RNA-seq data for progressive disease.
For `PT_HT4HJXY6`, there is WGS data for both the primary CNS tumor and the second malignancy, but RNA-seq data only for the second malignancy.

## By histology

We're going to use the `disease_type_new` column here.

### Primary only

We're including *any* assay type.
This table is different than what is plotted upstream in this module because we didn't restrict that to `Initial CNS Tumor` only.

```{r}
tumor_df %>%
filter(tumor_descriptor == "Initial CNS Tumor") %>%
distinct(Kids_First_Participant_ID, disease_type_new) %>%
group_by(disease_type_new) %>%
tally() %>%
arrange(desc(n)) %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

### Disease type - descriptors pairs

```{r}
disease_types_descriptors <- tumor_df %>%
group_by(Kids_First_Participant_ID) %>%
summarize(disease_types = paste(sort(unique(disease_type_new)),
collapse = ", "),
descriptors = paste(sort(unique(tumor_descriptor)),
collapse = ", ")) %>%
group_by(disease_types, descriptors) %>%
tally() %>%
arrange(desc(n))
disease_types_descriptors %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

#### Remove primary only

```{r}
disease_types_descriptors %>%
filter(descriptors != "Initial CNS Tumor") %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

What about when the primary tumor is present and paired with another point in time?

```{r}
disease_types_descriptors %>%
filter(descriptors != "Initial CNS Tumor") %>%
filter(grepl("Initial CNS Tumor", descriptors)) %>%
regulartable() %>%
fontsize(size = 12, part = "all")
```

Large diffs are not rendered by default.

23 changes: 15 additions & 8 deletions analyses/sample-distribution-analysis/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,26 +18,33 @@ This folder contains scripts tasked to analyze the distribution of samples acros
This script produces TSV files containing the counts and percentages of samples across each unique cancer type, and the types sorted in order of descending expression at each unique brain location.

`02-multilayer-pie.R` is a script written to produce an interactive treemap and multilayer pie chart representing the distribution of samples across broad histologies, short histologies, and tumor types.

The interactive treemap can be found [here](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/sample-distribution-analysis/plots/histology-treemap.html).
The interactive, multilayer pie chart can be found [here](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/plots/sample-distribution-analysis/histology-pie.html).


The `03-tumor-descriptor-and-assay-count` notebook contains a series of tables that count the number of each assay type and example primary vs. recurrence broken down by histology.
View the notebook [here](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/sample-distribution-analysis/03-tumor-descriptor-and-assay-count.nb.html).


The interactive treemap can be found [here](alexslemonade.github.io/OpenPBTA-analysis/histology-treemap.html).
The interactive, multilayer pie chart can be found [here](alexslemonade.github.io/OpenPBTA-analysis/histology-pie.html).

## Folder structure

The structure of this folder is as follows:

```
sample-distribution-analysis
├── README.md
├── 01-filter-across-types.R
├── 02-multilayer-plots.R
├── 03-tumor-descriptor-and-assay-count.Rmd
├── 03-tumor-descriptor-and-assay-count.nb.html
├── README.md
├── plots
│   ├── distribution_across_cancer_types.pdf
│  ├── histology-pie.html
│   ├── histology-pie.html
│   └── histology-treemap.html
├── results
│   ├── disease_expression.tsv
│  ├── primary_sites.tsv
│   └── sunburst_plot_df.tsv
│   ├── plots_df.tsv
│   └── primary_sites_counts.tsv
└── run-sample-distribution.sh
```
Binary file not shown.
Loading

0 comments on commit 615fdf5

Please sign in to comment.