This repository has been archived by the owner on Jun 21, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 67
/
02-add-ploidy-consensus.Rmd
148 lines (118 loc) · 5.67 KB
/
02-add-ploidy-consensus.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
title: "Add ploidy column, status to consensus SEG file"
output:
html_notebook:
toc: TRUE
toc_float: TRUE
author: Chante Bethell and Jaclyn Taroni for ALSF CCDL
date: 2020
---
The `pbta-histologies.tsv` file contains a `tumor_ploidy` column, which is tumor ploidy as inferred by ControlFreeC.
The copy number information should be interpreted in the light of this information (see: [current version of Data Formats file](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/0e642ef0602ad04b8d0fbd80be626d4374fee928/doc/data-formats.md#a-note-on-ploidy)).
This notebook adds ploidy information to the consensus SEG file and adds a status column that defines gain and loss broadly.
**Note** that the consensus SEG file does not have copy number information for sex chromosomes.
## Usage
This notebook is intended to be run from the command line with the following (assumes you are in the root directory of the repository):
```
Rscript -e "rmarkdown::render('analyses/focal-cn-file-preparation/02-add-ploidy-consensus.Rmd', clean = TRUE)"
```
## Set up
### Libraries and functions
```{r}
library(tidyverse)
```
### Files and directories
```{r}
scratch_dir <- file.path("..", "..", "scratch")
```
```{r}
# TODO: the consensus SEG file is not currently in the data download -- when it
# gets included we will have to change the file path here
consensus_seg_file <- file.path("..", "copy_number_consensus_call", "results",
"pbta-cnv-consensus.seg.gz")
histologies_file <- file.path("..", "..", "data", "pbta-histologies.tsv")
consensus_seg_df <- read_tsv(consensus_seg_file)
histologies_df <- read_tsv(histologies_file,
col_types = cols(
molecular_subtype = col_character()
))
```
## Add ploidy and status
As noted above, the sex chromosomes do not have copy number information included in the consensus SEG file, but other copy number values can be missing.
From the [consensus SEG file methods](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/0e642ef0602ad04b8d0fbd80be626d4374fee928/analyses/copy_number_consensus_call#consensus-cnv-creation):
> The `copy.num` column is the weighted median of CNVkit segment values where they exist, or Control-FREEC values in the absence of CNVkit data. Because some software (notably GISTIC) requires all samples to have the same regions called, the copy number variants from `cnv_consensus.tsv` are supplementented with "neutral" segments where no call was made. These include all non-variant regions present in `ref/cnv_callable.bed` The neutral regions are assigned copy.num 2, except on chrX and chrY, where the copy number is left NA.
For segments with missing `copy.num`, what chromosomes are they on?
```{r}
consensus_seg_df %>%
filter(is.na(copy.num)) %>%
group_by(chrom) %>%
tally()
```
The majority are on the sex chromosomes, which is expected.
We can remove these, as we can not add status using the ploidy information without the `copy.num` information.
We need to add in the `tumor_ploidy` column from the histologies file.
```{r}
add_ploidy_df <- consensus_seg_df %>%
filter(!is.na(copy.num)) %>%
inner_join(select(histologies_df,
Kids_First_Biospecimen_ID,
tumor_ploidy,
germline_sex_estimate),
by = c("ID" = "Kids_First_Biospecimen_ID")) %>%
select(-tumor_ploidy, -germline_sex_estimate, everything())
```
### Handle autosomes first
```{r}
add_autosomes_df <- add_ploidy_df %>%
# x and y chromosomes must be handled differently
filter(!(chrom %in% c("chrX", "chrY"))) %>%
mutate(status = case_when(
# when the copy number is less than inferred ploidy, mark this as a loss
copy.num < tumor_ploidy ~ "loss",
# if copy number is higher than ploidy, mark as a gain
copy.num > tumor_ploidy ~ "gain",
copy.num == tumor_ploidy ~ "neutral"
))
```
### Handle sex chromosomes
```{r}
# this logic is consistent with what is observed in the controlfreec file
# specifically, in samples where germline sex estimate = Female, X chromosomes
# appear to be treated the same way as autosomes
add_xy_df <- add_ploidy_df %>%
filter(chrom %in% c("chrX", "chrY")) %>%
mutate(status = case_when(
germline_sex_estimate == "Male" & copy.num < (0.5 * tumor_ploidy) ~ "loss",
germline_sex_estimate == "Male" & copy.num > (0.5 * tumor_ploidy) ~ "gain",
germline_sex_estimate == "Male" & copy.num == (0.5 * tumor_ploidy) ~ "neutral",
# there are some instances where there are chromosome Y segments are in
# samples where the germline_sex_estimate is Female
chrom == "chrY" & germline_sex_estimate == "Female" & copy.num > 0 ~ "unknown",
chrom == "chrY" & germline_sex_estimate == "Female" & copy.num == 0 ~ "neutral",
# mirroring controlfreec, X treated same as autosomes
chrom == "chrX" & germline_sex_estimate == "Female" & copy.num < tumor_ploidy ~ "loss",
chrom == "chrX" & germline_sex_estimate == "Female" & copy.num > tumor_ploidy ~ "gain",
chrom == "chrX" & germline_sex_estimate == "Female" & copy.num == tumor_ploidy ~ "neutral"
))
add_status_df <- bind_rows(add_autosomes_df, add_xy_df) %>%
select(-germline_sex_estimate) %>%
rename(Kids_First_Biospecimen_ID = ID)
```
### Does `seg.mean` agree with `status`?
```{r}
add_status_df %>%
filter(!is.na(seg.mean)) %>%
ggplot(aes(x = status, y = seg.mean, group = status)) +
geom_jitter()
```
```{r}
add_status_df %>%
filter(!is.na(seg.mean)) %>%
group_by(status) %>%
summarize(mean = mean(seg.mean), sd = sd(seg.mean))
```
### Write to file
```{r}
output_file <- file.path("..", "..", "scratch", "consensus_seg_with_status.tsv")
write_tsv(add_status_df, output_file)
```