-
Notifications
You must be signed in to change notification settings - Fork 67
Proposed Analysis: map from SEG file to genes (and broader segments) #186
Comments
I am currently working on the first step of this analysis as mentioned above and plan to implement the subsequent steps sequentially. |
I wanted to note that the first step should go before oncoprint plotting and interaction plots in CI to allow folks to develop off of the output. |
Hi! After not releasing the controlfreeC seg file yesterday, we figured out part of the problem was with LRR values and we are going to remove that information (or recalculate properly) in lieu of absolute CN. If you look at the new CNVkit seg file, you can see we added CN and so now this can be used instead of thresholding LRR (0 = homozygous loss, 1 = hemizygous loss, 2 = diploid, 3/4= gain, and some cutoff can be used for amplification - 5,6+ copies?). This will make life much easier, such that just the segments have to be mapped to genes. Will update with a new controlFreeC bed file today, hopefully, and will create an issue about this/explain this new format in the readme. |
Okay thank you -- looking forward to getting more information. Do you have an analysis that supports those calls that you can share publicly @jharenza ? In that case @cbethell, let's have the tabular output in this format:
|
Do you mean the pipeline or explanation of the output? I think we pushed that to our github and I can update the link in the manuscript. If not, will do. |
Sounds good. I would expect the explanation of the output to be in the issue or README you mentioned above, is that correct? |
Yep, I will add to the readme! |
Re: missing genes, you can potentially try using an hg38 GTF file to create a gene coordinates bed file and use the |
This is the Ensembl link for the hg38 GTF linked in Cavatica above: |
@cbethell and I are going to split up the work on this, as there are two main areas where this needs to be worked on. Changes to file preparation
There is currently a draft PR open looking at SMARCB1 deletions in ATRT #217 - we want to understand how the new annotation strategy and use of different methods affects these results. The output of
Changes and additions downstream of file preparation
I will take the file preparation part and @cbethell will take the downstream steps. |
I noticed in initial attempts to integrate CNV calls into #13 that there are a number of genes where CNVkit calls include both gain and loss for the vast majority of samples (see table below, noting that there are only 649 samples in the dataset used). This seems worth a bit of investigation. My initial guess is that most of these are bad calls in repetitive regions, and that restricting calls to exons may reduce this. (That also neatly skirts the strange behavior of Note also that many of these are also duplicate genes, which is probably making calls harder, but even so, calling in almost all samples as both gain and loss seems less than ideal.
|
@jashapiro I'll give |
Sounds good. In light of #241, we may want to think about switching over to reading in annotations from a GTF file too. |
Agreed 👍 |
While I'm working on the changes to file preparation, I think I'll also add in a step that maps from Entrez IDs to cytobands using the |
@jashapiro I followed up on your analysis above, where I counted how many instances of a gene being called as a gain and a loss within the same sample now that we've made the changes in #253. You can see the notebook for that here: https://jaclyn-taroni.github.io/openpbta-notebook-concept/both-gain-and-loss.nb.html It seems slightly better than before, but it appears to be much less of an issue in ControlFreeC. |
Oh, I think that looks a lot better. The secondary concern I had (which this nb does not quite address), is the extent to which certain regions are prone to false calls and/or reference sequences are non-representative of the population. The analysis above came up in my work on #13 because I was investigating the high number of very high frequency calls that were stymying my filters; the multi-call genes started as a secondary observation, but morphed into primary... |
Discussed in person - now that we have GISTIC broad arm calls, we should come up with a strategy for reconciling the broad arm calls with the annotated results from CNVkit. That is to say that we don't want to report individual genes as gained or lost if it would be more accurate to report a broader event. @cbethell and @cansavvy are going to take a look at this together. |
Does this mean we would prefer to remove the genes from the annotated files if it is more accurate to say the whole arm is gained or lost? OR do we want to add an extra column in the annotated files that says whether the arm for each gene reported is in an arm that is gained or lost (I think this second suggestion is more likely)? Part of this question has to do with gaining a better understanding of what the downstream application of the broad arm data would be. |
I would say that second option makes it easy to achieve the first via filtering downstream. |
Once #452 lands, we want to make sure we run
|
Scientific goals
We currently have copy number information in the form of SEG files. The goal of this analysis is to generate focal copy number files for consumption in the oncoprint (#6) and co-occurrence/mutual exclusivity modules (#13). In essence we want to know what genes are amplified or deleted in each sample for downstream analysis.
This will also have an element of determining what thresholds are appropriate to make the focal copy number calls.See: #186 (comment)The output of this analysis should be data in tabular format:
As @jharenza pointed out here on #182, it is not known at this time if we can make homozygous/hemizygous calls.
Proposed methods
Here's how we think this work should be broken up into a series of steps (as discussed with @cbethell and @jashapiro in person):
GenomicFeatures
package.It will use the same, somewhat arbitrarily chosen thresholds introduced in Addition of cnv data to oncoprint #182.2. An analysis that specifically examines SMARCB1 deletions in ATRTs. This will help with choosing thresholds.Required input data
For development:
data/pbta-cnv-cnvkit.seg.gz
In the future, we will want to use consensus calls (#128)
Proposed timeline
I expect the first step outlined above will be a few days at maximum.
The text was updated successfully, but these errors were encountered: