Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Update/Revamp focal-cn-file-preparation module #452

Merged
merged 17 commits into from
Jan 22, 2020

Conversation

cbethell
Copy link
Contributor

Purpose/implementation Section

Per #186 (comment), update and revamp the focal-cn-file-preparation module.

What GitHub issue does your pull request address?

This PR addresses issue #186.

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

  • Are there any steps that can be refactored?
  • The cytoband data was removed from the final results' tables in order to minimize memory usage, is this a decision that you would agree with?
  • Is there any additional information that should be added to the final tables in this PR?

Is there anything that you want to discuss further?

Note: I’m summarizing the GISTIC broad_values_by_arm.txt calls into loss, gain, and neutral, and amplification in the broad_status column, the same coding used for the values in the status column.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes, this is ready for review.

Results

The results in this PR include the updated tsv files found in this module's results directory as follows:

  • analyses/focal-cn-file-preparation/results/cnvkit_annotated_cn_autosomes.tsv.gz
  • analyses/focal-cn-file-preparation/results/cnvkit_annotated_cn_x_and_y.tsv.gz
  • analyses/focal-cn-file-preparation/results/controlfreec_annotated_cn_autosomes.tsv.gz
  • analyses/focal-cn-file-preparation/results/controlfreec_annotated_cn_x_and_y.tsv.gz

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

cansavvy and others added 3 commits January 15, 2020 14:50
- compare cytoband calls
- update `README`
- update comment in shell script and add `exon_file` option
- Incorporate GISTIC calls in final cn tables
- rerun `01-prepare-cn-file.R`
Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool beans! We are getting there!! 🚀

analyses/focal-cn-file-preparation/01-prepare-cn-file.R Outdated Show resolved Hide resolved
analyses/focal-cn-file-preparation/01-prepare-cn-file.R Outdated Show resolved Hide resolved
copy_number = cnv_gr$copy_number[overlaps_ucsc@from],
ploidy = cnv_gr$tumor_ploidy[overlaps_ucsc@from],
ensembl = cnv_gr$org_id[overlaps_ucsc@from],
cytoband = ucsc_cytoband_gr@elementMetadata@listData$cytoband[overlaps_ucsc@to],
stringsAsFactors = FALSE
) %>%
dplyr::distinct() %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this distinct doing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to remove this distinct or determine what it was doing, @cbethell ?

analyses/focal-cn-file-preparation/01-prepare-cn-file.R Outdated Show resolved Hide resolved
analyses/focal-cn-file-preparation/01-prepare-cn-file.R Outdated Show resolved Hide resolved
analyses/focal-cn-file-preparation/01-prepare-cn-file.R Outdated Show resolved Hide resolved
analyses/focal-cn-file-preparation/01-prepare-cn-file.R Outdated Show resolved Hide resolved
analyses/focal-cn-file-preparation/README.md Outdated Show resolved Hide resolved
analyses/focal-cn-file-preparation/run-prepare-cn.sh Outdated Show resolved Hide resolved
@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Jan 18, 2020

@cbethell @cansavvy - as you are working on this, keep in mind that there may be a period of time where we have a consensus SEG file that we want to annotate but no GISTIC file yet. (See #186 (comment) for more context.) Are there ways to make the GISTIC steps optional here? That change can be implemented in a later pull request, but it's a good to keep an eye on any design decisions here that may make that more difficult to implement later.

cbethell and others added 5 commits January 21, 2020 11:17
- rename instances of `exon` to `cds`
- update comments
- add large files (`gistic-results` and `results/cnvkit_annotated_cn_autosomes.tsv.gz` which is 101.2 MB) to `.gitignore`
- add inclusion of gistic results as an option
- update shell script to include gistic option
- remove `cytoband_org` variable
- update `README.md` 
- refactor section of script reading in and wrangling GISTIC data
- minor update to comment
Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbethell We are getting there! You did a good job trimming out some of the duplicate items! There's a few documentation adds I think we should make and I wanna take a look at your file that keeps changing sizes because I wanna get to the bottom of that before we call it a day.

analyses/focal-cn-file-preparation/01-prepare-cn-file.R Outdated Show resolved Hide resolved
analyses/focal-cn-file-preparation/01-prepare-cn-file.R Outdated Show resolved Hide resolved
analyses/focal-cn-file-preparation/01-prepare-cn-file.R Outdated Show resolved Hide resolved
analyses/focal-cn-file-preparation/01-prepare-cn-file.R Outdated Show resolved Hide resolved
#
# Usage: bash run-prepare-cn.sh

set -e
set -o pipefail

XYFLAG=${OPENPBTA_XY:-1}
cds_file=../../scratch/gencode.v27.primary_assembly.annotation.bed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been meaning to do this where I have this in snv-callers but we should probably include an indicator in this file name that tells people it's been filtered to CDS sequences only. Because it is a scratch file and isn't meant to stick around, I haven't made it a priority to change this. My bad.

analyses/focal-cn-file-preparation/.gitignore Outdated Show resolved Hide resolved
cbethell and others added 9 commits January 22, 2020 12:14
- update comments 
- rename gene id column 
- update comment in shell script 
- update column documentation in README
- run styler
- remove code in `.gitignore` file re large `cnvkit_annotated_autosomes.tsv.gz` file that no longer exists
- remove `gzip` step in `run-prepare-cn.sh` as this is handled in the R script
- update `run-oncoprint.sh`, `00-subset-files-for-ATRT.R`, `02-HGG-molecular-subtyping-subset-files.R`
@cbethell
Copy link
Contributor Author

cbethell commented Jan 22, 2020

@cansavvy I made the above documentation changes and changed the format of the output compressed files to bz2.

Let me know if you find something that I may have missed or that may need some refactoring.

@cansavvy
Copy link
Collaborator

@cbethell has the file stopped changing sizes now? Were you able to get to the bottom of that?

@cbethell
Copy link
Contributor Author

cbethell commented Jan 22, 2020

@cbethell has the file stopped changing sizes now? Were you able to get to the bottom of that?

@cansavvy the file was always the same size, it seems that one of the columns just compressed differently. I realized that the smaller file that I thought was a new file at the end of the day yesterday was really just the old compressed file so it all makes sense now.

Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool beans. This LGTM then! Assuming CircleCI passes.

jaclyn-taroni added a commit to jaclyn-taroni/OpenPBTA-analysis that referenced this pull request Jan 28, 2020
jaclyn-taroni added a commit that referenced this pull request Jan 28, 2020
…SEG case (#479)

* Back to exons; before GISTIC

Pre: #452

* Have expression script use gzipped; skip for now

* Renumber scripts and notebooks

* Add notebook for preparing consensus SEG file

Add ploidy!

* Make the cnvkit option name more general

* Rip out CDS steps

* Rip out GISTIC option

* Add back in the GTF option

* Remove the CDS + cytoband files

* Rename ID column in consensus prep

* Add consensus annotation to shell script; rerun

* Remove numbering from RNA expression step

* Update module README

* Update modules at a glance entry

* Add section summarizing seg.mean by status

* Fix downstream modules

* Update documentation to go back to gzipped file

* Get a bit smarter about paths and filenames

Also rerun + use $XYFLAG for the consensus step

* Use compressed consensus SEG file

* Rerun with new gzipped consensus SEG

* Skip consensus file annotation in CI

* Apply suggestions from code review

Co-Authored-By: Candace Savonen <cansav09@gmail.com>

* Update documentation

* Remove outdated comment

* Flesh out example usage

Co-authored-by: Candace Savonen <cansav09@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants