This repository has been archived by the owner on Jun 21, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 67
Update CNV segment to gene mapping: support both formats, use GTF, etc. #253
Merged
jaclyn-taroni
merged 14 commits into
AlexsLemonade:master
from
jaclyn-taroni:186-both-formats
Nov 9, 2019
Merged
Changes from 7 commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
c764edc
Add chromosome 1:22 filtering step
jaclyn-taroni 05b0a90
Add notebook for including status in CNVkit
jaclyn-taroni 2feea00
WIP update CN file prep
jaclyn-taroni be0248e
Remove outdated file
jaclyn-taroni 9cec863
Use GTF file + exons; add cytoband; support both methods
jaclyn-taroni 9bfe28e
Update module shell script and rerun
jaclyn-taroni a5f8a6b
Add TODO notes
jaclyn-taroni 4f827ca
Remove chromosome filter; fixes to shell script
jaclyn-taroni f452653
Add -f to gzip step
jaclyn-taroni 36cbb2b
Add steps for saving annotation db
jaclyn-taroni 741d55f
Fix how results are compressed
jaclyn-taroni eb183ae
Add chromosome filtering option
jaclyn-taroni b0f3615
Revert "Add steps for saving annotation db"
jaclyn-taroni 02e3e70
Revert "Revert "Add steps for saving annotation db""
jaclyn-taroni File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
60 changes: 60 additions & 0 deletions
60
analyses/focal-cn-file-preparation/00-add-ploidy-cnvkit.Rmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
--- | ||
title: "Add ploidy column, status to CNVkit output" | ||
output: html_notebook | ||
author: J. Taroni for ALSF CCDL | ||
date: 2019 | ||
--- | ||
|
||
The `pbta-histologies.tsv` file contains a `tumor_ploidy` column, which is tumor ploidy as inferred by ControlFreeC. | ||
The copy number information should be interpreted in the light of this information (see: [current version of Data Formats section of README](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/390f1e08e481da5ec0b2c62d886d5fd298bbf017#data-formats)). | ||
|
||
This notebook adds ploidy information to the CNVkit results and adds a status column that defines gain and loss broadly. | ||
|
||
```{r} | ||
library(dplyr) | ||
``` | ||
|
||
### Read in data | ||
|
||
```{r} | ||
cnvkit_file <- file.path("..", "..", "data", "pbta-cnv-cnvkit.seg.gz") | ||
cnvkit_df <- readr::read_tsv(cnvkit_file) | ||
``` | ||
|
||
```{r} | ||
histologies_file <- file.path("..", "..", "data", "pbta-histologies.tsv") | ||
histologies_df <- readr::read_tsv(histologies_file) | ||
``` | ||
|
||
### Add inferred ploidy information to CNVkit results | ||
|
||
```{r} | ||
add_ploidy_df <- histologies_df %>% | ||
select(Kids_First_Biospecimen_ID, tumor_ploidy) %>% | ||
inner_join(cnvkit_df, by = c("Kids_First_Biospecimen_ID" = "ID")) %>% | ||
select(-tumor_ploidy, everything()) | ||
``` | ||
|
||
### Add status column | ||
|
||
This is intended to mirror the information contained in the ControlFreeC output. | ||
|
||
```{r} | ||
add_ploidy_df <- add_ploidy_df %>% | ||
mutate(status = case_when( | ||
# when the copy number is less than inferred ploidy, mark this as a loss | ||
copy.num < tumor_ploidy ~ "loss", | ||
# if copy number is higher than ploidy, mark as a gain | ||
copy.num > tumor_ploidy ~ "gain", | ||
copy.num == tumor_ploidy ~ "neutral" | ||
)) | ||
|
||
head(add_ploidy_df, 10) | ||
``` | ||
|
||
### Write to `scratch` | ||
|
||
```{r} | ||
output_file <- file.path("..", "..", "scratch", "cnvkit_with_status.tsv") | ||
readr::write_tsv(add_ploidy_df, output_file) | ||
``` |
1,932 changes: 1,932 additions & 0 deletions
1,932
analyses/focal-cn-file-preparation/00-add-ploidy-cnvkit.nb.html
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file renamed
BIN
+81.9 MB
...file-preparation/results/annotated_cn.tsv → ...ults/cnvkit_annotated_cn_autosomes.tsv.gz
Binary file not shown.
Binary file added
BIN
+28 MB
analyses/focal-cn-file-preparation/results/controlfreec_annotated_cn_autosomes.tsv.gz
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With
GenomicFeatures::exons()
, this should not be necessary, as no exon is mapped to multiple/alternate chromosomes. I don't know if any of the calls fall on non-canonical chromosomes, but we might not want to exclude them at this point.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought was it might be more efficient if we drop anything outside this filter, but I have no evidence whatsoever to suggest that I am right about that. I will take it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay - this step in CI takes much longer (granted I made other changes), but I'm thinking of implementing a filter for CI only per https://github.com/AlexsLemonade/OpenPBTA-analysis#passing-variables-only-in-ci.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To close the loop - that wasn't the issue and I was looking at the wrong branch 🙃 I will leave in the filtering changes though