Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Planned Analysis: Integrated CNV and SV analyses and chromothripsis #27

Closed
jharenza opened this issue Jul 14, 2019 · 34 comments
Closed
Labels
cnv Related to or requires CNV data in progress Someone is working on this issue, but feel free to propose an alternative approach! sv Related to or requires SV data

Comments

@jharenza
Copy link
Collaborator

jharenza commented Jul 14, 2019

We have generated CNV output from ControlFreeC and CNVKit, but are seeking individuals to determine consensus focal calls and/or identify additional algorithms we can run to instill high confidence in focal CNV calls from the WGS dataset.

@cgreene
Copy link
Collaborator

cgreene commented Jul 14, 2019

After AlexsLemonade/OpenPBTA-manuscript#15 is approved and merged, can you write up the CNV methods and file a PR into that subsection so that we can link folks to the current version of the processing code?

It may change in the future, but then we'll have an accurate manuscript-ready description of what was done.

@jharenza
Copy link
Collaborator Author

This machine learning publication may help us with CN true positives:

@jharenza
Copy link
Collaborator Author

After AlexsLemonade/OpenPBTA-manuscript#15 is approved and merged, can you write up the CNV methods and file a PR into that subsection so that we can link folks to the current version of the processing code?

It may change in the future, but then we'll have an accurate manuscript-ready description of what was done.

Yes - will work on getting this filled in by the harmonization team.

@jharenza jharenza added the good first issue Good for newcomers label Jul 26, 2019
@gonzolgarcia
Copy link

gonzolgarcia commented Jul 30, 2019

Integrated CNV and SV analyses and chromothripsis.

The proposed analyses broadly addresses the prevalence and functional impact of structural variation across brain tumors. It is important to note that copy number variations are essentially a subset of structural variants and as such, both CNV and SV calls are highly overlapping and complementary and should be studied together. I am effectively proposing to merge #27 and #28 issues.

In order to integrate CNV calls and SV calls we focus on breakpoint co-locallization, more details in the manuscript: https://www.biorxiv.org/content/10.1101/572248v3

Chromothripsis is a catastrophic one time event involving multiple breakpoints and rearrangements of localized regions in the genome. As opposed to chromoplexia, which involve gradually acquired structural variations. Chromothripsis can be identified by a pattern of oscillating copy number states and concomitant structural variants that allow walking through the newly formed chromosome. In practical terms, It can be identified as regions of abnormally high number of CNVs and SVs.
Different available methods; all of which have limitations: ShatterSeek (https://github.com/parklab/ShatterSeek), Shatterproof (https://metacpan.org/release/SGOVIND/Shatterproof-0.13) & No-Name (https://www.biorxiv.org/content/10.1101/572248v3)(Focused on regions which SV density is 2 * std. dev above the average of each sample)

The input format for developing downstream analyses are:

CNV segmentation data:
SampleId,
chromosome,
Start,
End,
num_probes (depreciated, from SNP array format),
Segment_Mean (log T/N )

Allele specific CNV (optional; defining regions of LoH and allelic imbalance)
SampleId,
chromosome,
Start,
End,
BAF_mean
Call (LOH or AI)

SV calls file content: (already filtered by Somatic Score; no need to be annotated)
SampleId,
Chromosome-origin,
Start-origin,
End-origin
Chromosome-destination,
Start-destination,
End-destination,
sv_type: DEL, DUP, TRA and INV (often divided in head-to-head and tail-to-tail)

Some proposed readouts and output analyses

Structural variation.

  1. A measure of chromosomal instability (CIN) burden (density of breakpoints per Mb; similar to tumor mutational burden, TMB) and a plot by tumor type representing CIN burden (this could be compared to TMB).

  2. Recurrently altered genes (perhaps integrated in an Oncoprint with SNV?)
    For the oncoprint categories:
    - Amplification/tandem-duplication
    - Deep deletion/deletion
    - Other Structural variation: Inversion, translocations

  3. Focus on novel findings… If some newly recurrently altered gene arises will analyze in depth

Chromothripsis:
4) A barplot with the frequency of chromothripsis prevalence by tumor subtype
5) A few circus plots with examples of chromothripsis
6) association of chromothripsis with other somatic alterations (i.e. TP53 status)

Survival analyses (probably addressed in issue #18)
7) multivariate analyses including clinical variables as well as overall TMB and chromosomal instability burden and chromothripsis.

@jharenza jharenza changed the title Planned Analysis: WGS Copy Number Analysis Planned Analysis: Integrated CNV and SV analyses and chromothripsis Jul 30, 2019
@jharenza
Copy link
Collaborator Author

merged #27 and #28 here per @gonzolgarcia's request

@jharenza jharenza added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Jul 30, 2019
@jaclyn-taroni jaclyn-taroni removed the good first issue Good for newcomers label Aug 15, 2019
@gonzolgarcia
Copy link

Issue with lumpy data

As I am trying to filter somatic SVs from the table I realized that the evidence columns "Tumor" and "Normal" are switched.

In addition, there is no somatic score and haven't found much guidelines for somatic filtering of tumor/normal lumpy results. I will be considering this: arq5x/lumpy-sv#268

@jharenza
Copy link
Collaborator Author

Thanks, @gonzolgarcia! You are right, the T/N columns are swapped - we will fix this in V5 release coming next week.

@guru-yang
Copy link

The Yang Lab will perform analysis on chromothripsis.

@gonzolgarcia
Copy link

The Yang Lab will perform analysis on chromothripsis.

Note that there are two callers for CNV (cnvkit & controlfreek) and SV (manta & lumpy)
This dataset still require some further processing and filtering

@jaclyn-taroni
Copy link
Member

Note that there are two callers for CNV (cnvkit & controlfreek) and SV (manta & lumpy)
This dataset still require some further processing and filtering

@gonzolgarcia are you planning to generate SV consensus calls?

@gonzolgarcia
Copy link

Note that there are two callers for CNV (cnvkit & controlfreek) and SV (manta & lumpy)
This dataset still require some further processing and filtering

@gonzolgarcia are you planning to generate SV consensus calls?

Before getting a consensus, lumpy requires somatic filtering. It would be nice to have this added to next release

@jharenza
Copy link
Collaborator Author

jharenza commented Oct 7, 2019

Note that there are two callers for CNV (cnvkit & controlfreek) and SV (manta & lumpy)
This dataset still require some further processing and filtering

@gonzolgarcia are you planning to generate SV consensus calls?

Before getting a consensus, lumpy requires somatic filtering. It would be nice to have this added to next release

@guru-yang - do you have any experience with somatic filtering of LUMPY SVs? The comment referred to here suggests the following:

  1. Run SVTyper - docker
  2. Filter for somatic calls:
    a) keep non-reference SVs in the tumor;
    b) keep SVs which have no alternate depth (AO==0) in normal;
    c) keep SVs with sufficient depth in the normal (RO>~7)

@guru-yang
Copy link

Note that there are two callers for CNV (cnvkit & controlfreek) and SV (manta & lumpy)
This dataset still require some further processing and filtering

@gonzolgarcia are you planning to generate SV consensus calls?

Before getting a consensus, lumpy requires somatic filtering. It would be nice to have this added to next release

@guru-yang - do you have any experience with somatic filtering of LUMPY SVs? The comment referred to here suggests the following:

1. Run SVTyper - docker

2. Filter for somatic calls:
   a) keep non-reference SVs in the tumor;
   b) keep SVs which have no alternate depth (AO==0) in normal;
   c) keep SVs with sufficient depth in the normal (RO>~7)

We haven't used LUMPY at all. The filtering steps sounds reasonable. Based on my experience, Manta alone might be good enough for SV calling.

@gonzolgarcia
Copy link

l

Note that there are two callers for CNV (cnvkit & controlfreek) and SV (manta & lumpy)
This dataset still require some further processing and filtering

@gonzolgarcia are you planning to generate SV consensus calls?

Before getting a consensus, lumpy requires somatic filtering. It would be nice to have this added to next release

@guru-yang - do you have any experience with somatic filtering of LUMPY SVs? The comment referred to here suggests the following:

1. Run SVTyper - docker

2. Filter for somatic calls:
   a) keep non-reference SVs in the tumor;
   b) keep SVs which have no alternate depth (AO==0) in normal;
   c) keep SVs with sufficient depth in the normal (RO>~7)

We haven't used LUMPY at all. The filtering steps sounds reasonable. Based on my experience, Manta alone might be good enough for SV calling.

You're probably right and manta alone + cnvkit should be enough for Shatterseek?

@guru-yang
Copy link

You're probably right and manta alone + cnvkit should be enough for Shatterseek?

Should be enough.

@jharenza
Copy link
Collaborator Author

jharenza commented Oct 8, 2019

Great! @guru-yang and @gonzolgarcia - you can plan to use Manta + CNVkit for Shatterseek and then we can work on a filtered lumpy data file for release in the next few weeks for general recurrent SV analysis.

@jharenza
Copy link
Collaborator Author

@guru-yang and @gonzolgarcia as an update, we are going to remove LUMPY from the release. SVTyper processing is very long per sample (>10 hours), and will require some benchmarking for filtering, which we have de-prioritized in favor of benchmarking copy number. You have both said Manta is fine, so we will drop it. We will have a data release with new CN results coming next week #146, so please let us know if you need help with creating PRs!

@guru-yang
Copy link

@jharenza Thanks for the update. I am wondering how to get sample metadata. We are able to get gender, age at diagnose, tumor type from Kids First data portal. In order to perform survival analysis, age at last follow up would be needed. Do you know how to get that information? Are there any other information available for the patients, or their parents, such as smoking, alcohol consumption of the parents?

@cgreene
Copy link
Collaborator

cgreene commented Oct 25, 2019

@guru-yang : have you examined the metadata available in the files associated with this project? Once you do, could you file a new issue noting anything that's missing that you'd need for your analysis? Thanks!

@guru-yang
Copy link

@cgreene I am able to find overall survival in pedcbioportal. Thanks.

@jaclyn-taroni
Copy link
Member

Hi @guru-yang - overall survival, gender, age at diagnosis, and tumor type are all available in the pbta-histologies.tsv file that are part of the data files that are obtained by running the download-data.sh script.

We need people to use that file when putting together their analyses because that ensures that different contributors that are working independently are using the same information across their analyses (e.g., the same overall survival values). If there are additional fields you would like to see in the pbta-histologies.tsv file, please file a new issue requesting that information. Thank you!

@jharenza
Copy link
Collaborator Author

jharenza commented Oct 25, 2019

@jharenza Thanks for the update. I am wondering how to get sample metadata. We are able to get gender, age at diagnose, tumor type from Kids First data portal. In order to perform survival analysis, age at last follow up would be needed. Do you know how to get that information? Are there any other information available for the patients, or their parents, such as smoking, alcohol consumption of the parents?

@guru-yang as @jaclyn-taroni mentioned, the survival is in the provided histologies file in the data download. It is better to use this file, as we have further categorized tumors and provided additional data not in the KF portal. We do not have age at last followup in the file currently, but it can be added in the release due next week. Can you please file an issue for that? We have no parental information available, but if there are other things you would like to see from patients, you can also ask in an issue and I can check whether we have the info available.

@guru-yang
Copy link

@jaclyn-taroni @jharenza I see. Thanks a lot. What about smoking and alcohol usage for the probands? I don't expect smokers in pediatric cohort. Just curious.

@cgreene
Copy link
Collaborator

cgreene commented Oct 25, 2019

@guru-yang : please file a new github issue with requests for metadata so that we can keep this issue, currently titled "Planned Analysis: Integrated CNV and SV analyses and chromothripsis" on that topic. Thanks!

@jaclyn-taroni jaclyn-taroni added cnv Related to or requires CNV data sv Related to or requires SV data labels Oct 26, 2019
@jharenza
Copy link
Collaborator Author

jharenza commented Nov 1, 2019

Hi @gonzolgarcia and @guru-yang! When do you think you will be able to file a pull request with either of your analyses? Thanks!

@guru-yang
Copy link

@jharenza We have made some progress. Is there a regular conference call or similar to share results among the group? Or everything is through github?

@jaclyn-taroni
Copy link
Member

Hi @guru-yang, great to hear! We encourage you to file pull requests adding the code used to generate results as you have them. The analysis does not need to be complete before getting added to the repository. We have a pull request template with a section for summarizing results to facilitate discussion. You can join the Cancer Data Science Slack #open-pbta channel (more information here) if you have questions about the pull request model that are better answered in real-time.

@cgreene
Copy link
Collaborator

cgreene commented Nov 3, 2019

I will echo @jaclyn-taroni and @jharenza : please file pull requests adding code as you are writing it. It is much harder to integrate a large amount of code after it is entirely written. Thanks!

@guru-yang
Copy link

@jharenza @jaclyn-taroni @cgreene Will try to do that soon. I am traveling this week. One quick question, we have seen quite some patients with more than one tumors sequenced. When working on variants, is there a particular strategy to handle these tumors? Such as randomly pick one?

@jashapiro
Copy link
Member

As of the v7 release, we now provide lists of independent specimens (one tumor per individual) that we would like analyses to use. These are randomly selected, as you suggest, but this allows everyone to use consistent sets. See the bottom of the Data Formats section of the README for descriptions of those files.

@guru-yang
Copy link

I noticed in some samples the CNV calls from two algorithms are quite different. I wonder what's the plan going forward. It seems to me generating a consensus CNV call is not easy.

@jaclyn-taroni
Copy link
Member

Hi @guru-yang - have you taken a look at the copy number consensus issue: #128?

@gonzolgarcia
Copy link

Hello everyone, I wanted to apologize for my lack of contribution to this issue, which I proposed initially. Unfortunately the requirements of my new position at Mount Sinai have let me with very little time bandwidth. For the time being I cannot guaranty that I will contributing steadily to this issue. However, I'd be happy to provide support if still needed as I am working on developing new tools for the integrated analysis of CNVs and structural variations. Best regards to everyone.

@jaclyn-taroni
Copy link
Member

I filed two more focused issues based on what analyses are in progress vs. those that are not currently accounted for: #393 and #394

Closing this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cnv Related to or requires CNV data in progress Someone is working on this issue, but feel free to propose an alternative approach! sv Related to or requires SV data
Projects
None yet
Development

No branches or pull requests

6 participants