Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Proposed Analysis: map from SEG file to genes (and broader segments) #186

Closed
jaclyn-taroni opened this issue Oct 30, 2019 · 22 comments
Closed
Labels
cnv Related to or requires CNV data in progress Someone is working on this issue, but feel free to propose an alternative approach! proposed analysis

Comments

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Oct 30, 2019

Scientific goals

We currently have copy number information in the form of SEG files. The goal of this analysis is to generate focal copy number files for consumption in the oncoprint (#6) and co-occurrence/mutual exclusivity modules (#13). In essence we want to know what genes are amplified or deleted in each sample for downstream analysis. This will also have an element of determining what thresholds are appropriate to make the focal copy number calls. See: #186 (comment)

The output of this analysis should be data in tabular format:

gene_symbol biospecimen_id label seg.mean
SMARCB1 BS_XXXXXXXX Hem_Deletion some float
... ... ... ...
... ... ... ...
... ... ... ...

As @jharenza pointed out here on #182, it is not known at this time if we can make homozygous/hemizygous calls.

Proposed methods

Here's how we think this work should be broken up into a series of steps (as discussed with @cbethell and @jashapiro in person):

  1. Put together code, adapted from this code linked by @jharenza, that takes a SEG file as input and produces the tables outlined above. This will most likely make use of the GenomicFeatures package. It will use the same, somewhat arbitrarily chosen thresholds introduced in Addition of cnv data to oncoprint #182.
    2. An analysis that specifically examines SMARCB1 deletions in ATRTs. This will help with choosing thresholds.
  2. An analysis that examines RNA-seq expression levels for genes that are called as homozygous deletions. This could potentially help with cutoffs as well.
  3. Moving from single genes to broader segments of the genome. We can use tabular format above + information about how genes are ordered on chromosomes to get an idea of what broader amplifications or deletions are recurrent. Whiteboard sketch capturing the overall idea:
    gene-to-broader

Required input data

For development: data/pbta-cnv-cnvkit.seg.gz
In the future, we will want to use consensus calls (#128)

Proposed timeline

I expect the first step outlined above will be a few days at maximum.

@jaclyn-taroni jaclyn-taroni added proposed analysis cnv Related to or requires CNV data labels Oct 30, 2019
@cbethell
Copy link
Contributor

I am currently working on the first step of this analysis as mentioned above and plan to implement the subsequent steps sequentially.

@jaclyn-taroni jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Oct 30, 2019
@jaclyn-taroni
Copy link
Member Author

I wanted to note that the first step should go before oncoprint plotting and interaction plots in CI to allow folks to develop off of the output.

@jharenza
Copy link
Collaborator

Hi! After not releasing the controlfreeC seg file yesterday, we figured out part of the problem was with LRR values and we are going to remove that information (or recalculate properly) in lieu of absolute CN. If you look at the new CNVkit seg file, you can see we added CN and so now this can be used instead of thresholding LRR (0 = homozygous loss, 1 = hemizygous loss, 2 = diploid, 3/4= gain, and some cutoff can be used for amplification - 5,6+ copies?). This will make life much easier, such that just the segments have to be mapped to genes.

Will update with a new controlFreeC bed file today, hopefully, and will create an issue about this/explain this new format in the readme.

@jaclyn-taroni
Copy link
Member Author

Okay thank you -- looking forward to getting more information. Do you have an analysis that supports those calls that you can share publicly @jharenza ?

In that case @cbethell, let's have the tabular output in this format:

gene_symbol biospecimen_id label copy_number
YFG1 BS_XXXXXXXX hemizygous_loss 1
... ... ... ...
... ... ... ...
... ... ... ...

@jharenza
Copy link
Collaborator

Do you mean the pipeline or explanation of the output? I think we pushed that to our github and I can update the link in the manuscript. If not, will do.

@jaclyn-taroni
Copy link
Member Author

I think we pushed that to our github and I can update the link in the manuscript. If not, will do.

Sounds good. I would expect the explanation of the output to be in the issue or README you mentioned above, is that correct?

@jharenza
Copy link
Collaborator

Yep, I will add to the readme!

@jharenza
Copy link
Collaborator

jharenza commented Nov 4, 2019

Re: missing genes, you can potentially try using an hg38 GTF file to create a gene coordinates bed file and use the bedr R package to do intersections of genes with segments.

@jaclyn-taroni
Copy link
Member Author

This is the Ensembl link for the hg38 GTF linked in Cavatica above: ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz

@jaclyn-taroni
Copy link
Member Author

@cbethell and I are going to split up the work on this, as there are two main areas where this needs to be worked on.

Changes to file preparation

There is currently a draft PR open looking at SMARCB1 deletions in ATRT #217 - we want to understand how the new annotation strategy and use of different methods affects these results.

The output of 01-prepare-cn-file.R will get updated to be:

gene_symbol biospecimen_id label copy_number ploidy
YFG1 BS_XXXXXXXX loss 1 3
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...

Changes and additions downstream of file preparation

I will take the file preparation part and @cbethell will take the downstream steps.

@jashapiro
Copy link
Member

I noticed in initial attempts to integrate CNV calls into #13 that there are a number of genes where CNVkit calls include both gain and loss for the vast majority of samples (see table below, noting that there are only 649 samples in the dataset used). This seems worth a bit of investigation. My initial guess is that most of these are bad calls in repetitive regions, and that restricting calls to exons may reduce this. (That also neatly skirts the strange behavior of txdb objects with the gene() selector.) I hope that #128 will also take care of many of these cases.

Note also that many of these are also duplicate genes, which is probably making calls harder, but even so, calling in almost all samples as both gain and loss seems less than ideal.

gene_cnv mutant_samples
CBS_Gain 646
CBS_Loss 643
CBSL_Gain 646
CBSL_Loss 643
CRYAA_Gain 646
CRYAA_Loss 643
CRYAA2_Gain 646
CRYAA2_Loss 643
FAM243A_Gain 646
FAM243A_Loss 642
LOC101927050_Gain 633
LOC286297_Loss 634
MIR8069-2_Gain 645
MIR8069-2_Loss 643
SMIM11B_Gain 646
SMIM11B_Loss 642
U2AF1_Gain 646
U2AF1_Loss 643
U2AF1L5_Gain 646
U2AF1L5_Loss 643

@cbethell cbethell mentioned this issue Nov 6, 2019
8 tasks
@jaclyn-taroni
Copy link
Member Author

@jashapiro I'll give GenomicFeatures::exons a shot and then let's take a look at these specific genes?

@jashapiro
Copy link
Member

Sounds good. In light of #241, we may want to think about switching over to reading in annotations from a GTF file too.

@jaclyn-taroni
Copy link
Member Author

Agreed 👍

@jaclyn-taroni
Copy link
Member Author

While I'm working on the changes to file preparation, I think I'll also add in a step that maps from Entrez IDs to cytobands using the org.Hs.eg.db package (org.Hs.egMAP). I don't think I'm going to put this into the same tabular format outlined above.

@jaclyn-taroni
Copy link
Member Author

@jashapiro I followed up on your analysis above, where I counted how many instances of a gene being called as a gain and a loss within the same sample now that we've made the changes in #253. You can see the notebook for that here: https://jaclyn-taroni.github.io/openpbta-notebook-concept/both-gain-and-loss.nb.html

It seems slightly better than before, but it appears to be much less of an issue in ControlFreeC.

@jashapiro
Copy link
Member

It seems slightly better than before, but it appears to be much less of an issue in ControlFreeC.

Oh, I think that looks a lot better. The secondary concern I had (which this nb does not quite address), is the extent to which certain regions are prone to false calls and/or reference sequences are non-representative of the population. The analysis above came up in my work on #13 because I was investigating the high number of very high frequency calls that were stymying my filters; the multi-call genes started as a secondary observation, but morphed into primary...

@cbethell cbethell mentioned this issue Nov 14, 2019
8 tasks
@jaclyn-taroni
Copy link
Member Author

Discussed in person - now that we have GISTIC broad arm calls, we should come up with a strategy for reconciling the broad arm calls with the annotated results from CNVkit. That is to say that we don't want to report individual genes as gained or lost if it would be more accurate to report a broader event. @cbethell and @cansavvy are going to take a look at this together.

@cansavvy
Copy link
Collaborator

That is to say that we don't want to report individual genes as gained or lost if it would be more accurate to report a broader event.

Does this mean we would prefer to remove the genes from the annotated files if it is more accurate to say the whole arm is gained or lost? OR do we want to add an extra column in the annotated files that says whether the arm for each gene reported is in an arm that is gained or lost (I think this second suggestion is more likely)? Part of this question has to do with gaining a better understanding of what the downstream application of the broad arm data would be.

@jaclyn-taroni
Copy link
Member Author

I would say that second option makes it easy to achieve the first via filtering downstream.

@jaclyn-taroni
Copy link
Member Author

Once #452 lands, we want to make sure we run analyses/copy_number_consensus_call/results/pbta-cnv-consensus.seg through the annotation steps in focal-cn-file-preparation now that #441 has been merged. This requires the following changes:

  • Adding analyses/copy_number_consensus_call/results/pbta-cnv-consensus.seg to the processing in analyses/focal-cn-file-preparation/00-add-ploidy-cnvkit.Rmd
  • Including the annotation of this file in analyses/focal-cn-file-preparation/run-prepare-cn.sh (e.g., passing it to analyses/focal-cn-file-preparation/01-prepare-cn-file.R)
  • You may want to make GISTIC files optional in analyses/focal-cn-file-preparation/01-prepare-cn-file.R because we don't have the GISTIC consensus results yet (tracked here: Updated analysis: rerun GISTIC with consensus SEG file #453).

@jaclyn-taroni
Copy link
Member Author

Closing this - what's left is now tracked in #497 and #485.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cnv Related to or requires CNV data in progress Someone is working on this issue, but feel free to propose an alternative approach! proposed analysis
Projects
None yet
Development

No branches or pull requests

5 participants