Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Proposed Analysis: Copy number consensus calls #128

Closed
jharenza opened this issue Sep 25, 2019 · 24 comments
Closed

Proposed Analysis: Copy number consensus calls #128

jharenza opened this issue Sep 25, 2019 · 24 comments
Labels
cnv Related to or requires CNV data in progress Someone is working on this issue, but feel free to propose an alternative approach! proposed analysis

Comments

@jharenza
Copy link
Collaborator

Scientific goals

What are the scientific goals of the analysis?
Create consensus calls from ControlFreeC and CNVKit

Proposed methods

What methods do you plan to use to accomplish the scientific goals?
Breakpoints will not perfectly overlap between algorithms, so the analyst will likely have to define a window for overlap of copy number alterations to deem consensus calls.

Required input data

What input data will you use for this analysis?
SEG files

Proposed timeline

What is the timeline for the analysis?
2 weeks

Relevant literature

If there is relevant scientific literature, put links to those items here.

@xhb1991
Copy link

xhb1991 commented Sep 25, 2019

We will use additional methods to get CNV calls with a balanced sensitivity and specificity for the cohort.

@jaclyn-taroni
Copy link
Member

Are you planning on tackling this @jharenza and @xiehongbo? If so, I will mark as in progress.

@fingerfen
Copy link
Contributor

We have a pipeline, count me in as well!

@xhb1991
Copy link

xhb1991 commented Sep 25, 2019

Yeah, we are tackling this.

@jaclyn-taroni jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Sep 25, 2019
@jharenza
Copy link
Collaborator Author

We have a pipeline, count me in as well!

@fingerfen - great! What is the pipeline and what inputs do you need? We may have to set you up on CAVATICA to run this.

@xhb1991
Copy link

xhb1991 commented Sep 25, 2019

let talk about it today during our meeting.

@jaclyn-taroni
Copy link
Member

Hi @jharenza @xiehongbo @fingerfen,

Do you have an idea of when we should expect the first pull request for this issue? I am also wondering if we know what the format of the output of this analysis will be now that the two callers have different file formats. This information will help us in development for issues like #6 and #186.

@jharenza
Copy link
Collaborator Author

jharenza commented Nov 1, 2019

Hi @fingerfen and @xiehongbo - we were able to finish the data releases to include the new CNVkit and ControlFreeC files, so now you are able to submit a pull request with your analysis. Is it possible to do this next week? @fingerfen can you also list here the columns you will have in your final file for @jaclyn-taroni ? Thanks!

@hongboxie
Copy link

We assume, that sample QC has been done by the sample noisy level.
here is what we did for summarizing consensus CNVs from three predictors of somatic CNs

  1. We define two CNVs are the same event if the CNVs overlapping each other >50% reciprocally. We typically use much higher threshold for Germline CNVs (60-80%)
  2. We took any CNVs that are identified by >=2 approaches (cnvkit,freec, etc).
    currently we only summarize deletions (CN=0,1) or amplifications (CN>2).
  3. We listed each consensus CNV by listing their Chromosome, Start Position, End Position, the original CNV identified by each CNV detection method.
  4. Currently we took the average of the breakpoints from different methods as our consensus breakpoints of consensus CNVs.
  5. We remove any CNVs that majority of the content overlapping centromere , telomere, IGLL regions , segmental duplications (>50%) (* note, we do not remove by removing them, but mark the fraction of overlapping those blacklist regions)

@jashapiro
Copy link
Member

We assume, that sample QC has been done by the sample noisy level.

I do not understand what this is referring to. Looking at the manuscript, I am not clear on what QC steps were performed on CNV calls, if any. Are there standard QC steps that should be added to the CNV results, and/or documented in the manuscript? Perhaps @jharenza or @yuankunzhu can provide some insight here?

@hongboxie
Copy link

@jharenza I think we had this discussion before. Do you want to remove samples with extra high noisy levels, or keep every sample regardless. If we do want to remove samples with high noisy level, which "noisy sample" refers sample with high SD of Depth of coverage, we can do so. Otherwise, we can report CNVs from ALL samples. Up to your guys. I am fine with either way.

@hongboxie
Copy link

@jashapiro when you recover CNVs from given samples, do you remove "noisy" samples? If so, what is your practice to do so?

@jashapiro
Copy link
Member

@hongboxie That makes sense. As I do not have the raw data, I can't see the raw coverage metrics, but after you brought up such QC, I went to look in the manuscript for details, and didn't find them mentioned. This is not my area of expertise, so I do not know what the standard practices are; I am just trying to understand the data as we work on some of the downstream analysis.

@hongboxie
Copy link

@jashapiro no worries! I am learning this topic from everyone as well. I am open to any suggestions.

@jharenza
Copy link
Collaborator Author

jharenza commented Nov 8, 2019

@hongboxie sorry just reading this now... We discussed with @fingerfen to remove any samples if they showed whole genome-gain. That was a measure of inaccurate CN calling. I think we also chose to use a cutoff of >2500 segments for noise to follow what the arrays used, but I also may recall that ControlFreeC might not have smoothed their segments as CNVkit did (ie collapse multiple into one), so they may have a larger number of segments than expected and this cutoff may not be good. @yuankunzhu do you remember when we redid the ControlFreeC TSV file, if this is smoothed?

No samples were removed when we provided the data. This was being done via #128. When I checked out the CNVkit seg, the sample call quality all looked reasonable in IGV, so this may only apply to controlFreeC.

@hongboxie
Copy link

We are perplexed by the outcome of Manta. There are multiple CNVs overlapping the same region. Currently we decide to merge all CNVs into one single consent event. There are something strange about Manta somatic CNV output.

@hongboxie
Copy link

We haven't had a chance to dig into the root of the problem.

@hongboxie
Copy link

About Manta:

  1. we found many CNVs overlapping the same region in Manta's outcome;
  2. It seems those should come from PE based CNV detection (I think)
    3)Currently our consent is to merge those CNVs into one segment of CNV, and hoping other methods (CNVkit/FreeC) will help to eliminate any false positives.

@hongboxie
Copy link

About consensus CNV:

  1. I think ploidy detection is subjective, based on current approaches, for instance, it is very hard to get a consensus when one predictor predict there are 30 copies of one segment, the other predictor claims a different number of ploidy.
  2. We will present two types of CNVs: Amplification (ploity>2) or Deletion (ploidy<2). We created 3 additional columns where we display original outcome of each method overlapping this region as evidence. The original output of each predictor will carry the information of the ploidy, if it has it.

@fingerfen
Copy link
Contributor

Pipeline_Visual_Example_on_chr7.pptx

@jharenza Attached is the ppt from Wednesday's presentation. Sorry for the delay.

@fingerfen fingerfen mentioned this issue Nov 21, 2019
2 tasks
@jaclyn-taroni
Copy link
Member

Documenting the outcomes of the in person meeting this afternoon (@jaclyn-taroni @jashapiro @hongboxie @fingerfen):

  • The first pull request will consist of the python script that parses the different callers files into files for individual biospecimens + sets up the snakemake file. A step running this script will also get added to .circleci/config.yml
  • A second step running the snakemake pipeline will need to be added to .circleci/config.yml OR a shell script that 1. runs merged_to_individual_files.py and 2. runs the snakemake pipeline will need to get added and be the single entry in .circleci/config.yml for this analysis.
  • Subsequent pull requests will contain a single python script and the updates to the snakemake file that run that python script. That allows us to test each script in CI. It's also fine to include some of the bedtool steps along side the python script additions in these pull requests so long as the pull requests are not many (400+) lines of code.
  • To make the final file that will be included in the data download, we probably want a single file that contains both DEL and DUP events for all biospecimens.

@jharenza jharenza mentioned this issue Dec 23, 2019
7 tasks
@jharenza
Copy link
Collaborator Author

closed via #357 #349 #328 #313 #403 #288 #416

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cnv Related to or requires CNV data in progress Someone is working on this issue, but feel free to propose an alternative approach! proposed analysis
Projects
None yet
Development

No branches or pull requests

6 participants