Proposed Analysis: map from SEG file to genes (and broader segments) #186

jaclyn-taroni · 2019-10-30T15:41:08Z

Scientific goals

We currently have copy number information in the form of SEG files. The goal of this analysis is to generate focal copy number files for consumption in the oncoprint (#6) and co-occurrence/mutual exclusivity modules (#13). In essence we want to know what genes are amplified or deleted in each sample for downstream analysis. ~~This will also have an element of determining what thresholds are appropriate to make the focal copy number calls.~~ See: #186 (comment)

The output of this analysis should be data in tabular format:

gene_symbol	biospecimen_id	label	seg.mean
SMARCB1	BS_XXXXXXXX	Hem_Deletion	some float
...	...	...	...
...	...	...	...
...	...	...	...

As @jharenza pointed out here on #182, it is not known at this time if we can make homozygous/hemizygous calls.

Proposed methods

Here's how we think this work should be broken up into a series of steps (as discussed with @cbethell and @jashapiro in person):

Put together code, adapted from this code linked by @jharenza, that takes a SEG file as input and produces the tables outlined above. This will most likely make use of the GenomicFeatures package. ~~It will use the same, somewhat arbitrarily chosen thresholds introduced in Addition of cnv data to oncoprint #182.~~
~~2. An analysis that specifically examines SMARCB1 deletions in ATRTs. This will help with choosing thresholds.~~
An analysis that examines RNA-seq expression levels for genes that are called as homozygous deletions. This could potentially help with cutoffs as well.
Moving from single genes to broader segments of the genome. We can use tabular format above + information about how genes are ordered on chromosomes to get an idea of what broader amplifications or deletions are recurrent. Whiteboard sketch capturing the overall idea:

Required input data

For development: data/pbta-cnv-cnvkit.seg.gz
In the future, we will want to use consensus calls (#128)

Proposed timeline

I expect the first step outlined above will be a few days at maximum.

The text was updated successfully, but these errors were encountered:

cbethell · 2019-10-30T15:48:42Z

I am currently working on the first step of this analysis as mentioned above and plan to implement the subsequent steps sequentially.

jaclyn-taroni · 2019-10-31T10:57:33Z

I wanted to note that the first step should go before oncoprint plotting and interaction plots in CI to allow folks to develop off of the output.

jharenza · 2019-10-31T12:20:19Z

Hi! After not releasing the controlfreeC seg file yesterday, we figured out part of the problem was with LRR values and we are going to remove that information (or recalculate properly) in lieu of absolute CN. If you look at the new CNVkit seg file, you can see we added CN and so now this can be used instead of thresholding LRR (0 = homozygous loss, 1 = hemizygous loss, 2 = diploid, 3/4= gain, and some cutoff can be used for amplification - 5,6+ copies?). This will make life much easier, such that just the segments have to be mapped to genes.

Will update with a new controlFreeC bed file today, hopefully, and will create an issue about this/explain this new format in the readme.

jaclyn-taroni · 2019-10-31T12:35:09Z

Okay thank you -- looking forward to getting more information. Do you have an analysis that supports those calls that you can share publicly @jharenza ?

In that case @cbethell, let's have the tabular output in this format:

gene_symbol	biospecimen_id	label	copy_number
YFG1	BS_XXXXXXXX	hemizygous_loss	1
...	...	...	...
...	...	...	...
...	...	...	...

jharenza · 2019-10-31T12:42:21Z

Do you mean the pipeline or explanation of the output? I think we pushed that to our github and I can update the link in the manuscript. If not, will do.

jaclyn-taroni · 2019-10-31T12:46:38Z

I think we pushed that to our github and I can update the link in the manuscript. If not, will do.

Sounds good. I would expect the explanation of the output to be in the issue or README you mentioned above, is that correct?

jharenza · 2019-10-31T13:04:22Z

Yep, I will add to the readme!

jharenza · 2019-11-04T18:55:08Z

Re: missing genes, you can potentially try using an hg38 GTF file to create a gene coordinates bed file and use the bedr R package to do intersections of genes with segments.

jaclyn-taroni · 2019-11-04T20:00:06Z

This is the Ensembl link for the hg38 GTF linked in Cavatica above: ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz

jaclyn-taroni · 2019-11-05T14:10:49Z

@cbethell and I are going to split up the work on this, as there are two main areas where this needs to be worked on.

Changes to file preparation

Using the bedr R package and Homo_sapiens.GRCh38.84.gtf instead of the R annotation package we currently use - the dependencies were added to the Docker image in Add bedr package to Docker and dependencies #225
Add functionality to use ControlFreeC input / use ploidy information for CNVkit
Use loss and gain broadly where that is not already the case https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/216/files#r341868719

There is currently a draft PR open looking at SMARCB1 deletions in ATRT #217 - we want to understand how the new annotation strategy and use of different methods affects these results.

The output of 01-prepare-cn-file.R will get updated to be:

gene_symbol	biospecimen_id	label	copy_number	ploidy
YFG1	BS_XXXXXXXX	loss	1	3
...	...	...	...	...
...	...	...	...	...
...	...	...	...	...

Changes and additions downstream of file preparation

Look at RNA-seq expression for loss calls (use RSEM data) - we probably don't want to rely on this to make definitive total loss calls (https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/216/files#r341875649) but let's see what this looks like, generally speaking
Use the output of 01-prepare-cn-file.R in the Oncoprint plotting (Planned Analysis: Oncoprint showing landscape of genetic lesions across PBTA. #6)
The "Moving from single genes to broader segments of the genome" described in the steps above

I will take the file preparation part and @cbethell will take the downstream steps.

jashapiro · 2019-11-06T16:33:07Z

I noticed in initial attempts to integrate CNV calls into #13 that there are a number of genes where CNVkit calls include both gain and loss for the vast majority of samples (see table below, noting that there are only 649 samples in the dataset used). This seems worth a bit of investigation. My initial guess is that most of these are bad calls in repetitive regions, and that restricting calls to exons may reduce this. (That also neatly skirts the strange behavior of txdb objects with the gene() selector.) I hope that #128 will also take care of many of these cases.

Note also that many of these are also duplicate genes, which is probably making calls harder, but even so, calling in almost all samples as both gain and loss seems less than ideal.

gene_cnv	mutant_samples
CBS_Gain	646
CBS_Loss	643
CBSL_Gain	646
CBSL_Loss	643
CRYAA_Gain	646
CRYAA_Loss	643
CRYAA2_Gain	646
CRYAA2_Loss	643
FAM243A_Gain	646
FAM243A_Loss	642
LOC101927050_Gain	633
LOC286297_Loss	634
MIR8069-2_Gain	645
MIR8069-2_Loss	643
SMIM11B_Gain	646
SMIM11B_Loss	642
U2AF1_Gain	646
U2AF1_Loss	643
U2AF1L5_Gain	646
U2AF1L5_Loss	643

jaclyn-taroni · 2019-11-07T19:45:02Z

@jashapiro I'll give GenomicFeatures::exons a shot and then let's take a look at these specific genes?

jashapiro · 2019-11-07T20:00:35Z

Sounds good. In light of #241, we may want to think about switching over to reading in annotations from a GTF file too.

jaclyn-taroni · 2019-11-07T20:01:10Z

Agreed 👍

jaclyn-taroni · 2019-11-08T15:36:48Z

While I'm working on the changes to file preparation, I think I'll also add in a step that maps from Entrez IDs to cytobands using the org.Hs.eg.db package (org.Hs.egMAP). I don't think I'm going to put this into the same tabular format outlined above.

jaclyn-taroni · 2019-11-09T13:02:34Z

@jashapiro I followed up on your analysis above, where I counted how many instances of a gene being called as a gain and a loss within the same sample now that we've made the changes in #253. You can see the notebook for that here: https://jaclyn-taroni.github.io/openpbta-notebook-concept/both-gain-and-loss.nb.html

It seems slightly better than before, but it appears to be much less of an issue in ControlFreeC.

jashapiro · 2019-11-09T14:35:41Z

It seems slightly better than before, but it appears to be much less of an issue in ControlFreeC.

Oh, I think that looks a lot better. The secondary concern I had (which this nb does not quite address), is the extent to which certain regions are prone to false calls and/or reference sequences are non-representative of the population. The analysis above came up in my work on #13 because I was investigating the high number of very high frequency calls that were stymying my filters; the multi-call genes started as a secondary observation, but morphed into primary...

jaclyn-taroni · 2020-01-14T14:59:34Z

Discussed in person - now that we have GISTIC broad arm calls, we should come up with a strategy for reconciling the broad arm calls with the annotated results from CNVkit. That is to say that we don't want to report individual genes as gained or lost if it would be more accurate to report a broader event. @cbethell and @cansavvy are going to take a look at this together.

cansavvy · 2020-01-15T15:20:49Z

That is to say that we don't want to report individual genes as gained or lost if it would be more accurate to report a broader event.

Does this mean we would prefer to remove the genes from the annotated files if it is more accurate to say the whole arm is gained or lost? OR do we want to add an extra column in the annotated files that says whether the arm for each gene reported is in an arm that is gained or lost (I think this second suggestion is more likely)? Part of this question has to do with gaining a better understanding of what the downstream application of the broad arm data would be.

jaclyn-taroni · 2020-01-15T15:21:51Z

I would say that second option makes it easy to achieve the first via filtering downstream.

jaclyn-taroni · 2020-01-18T18:32:57Z

Once #452 lands, we want to make sure we run analyses/copy_number_consensus_call/results/pbta-cnv-consensus.seg through the annotation steps in focal-cn-file-preparation now that #441 has been merged. This requires the following changes:

Adding analyses/copy_number_consensus_call/results/pbta-cnv-consensus.seg to the processing in analyses/focal-cn-file-preparation/00-add-ploidy-cnvkit.Rmd
Including the annotation of this file in analyses/focal-cn-file-preparation/run-prepare-cn.sh (e.g., passing it to analyses/focal-cn-file-preparation/01-prepare-cn-file.R)
You may want to make GISTIC files optional in analyses/focal-cn-file-preparation/01-prepare-cn-file.R because we don't have the GISTIC consensus results yet (tracked here: Updated analysis: rerun GISTIC with consensus SEG file #453).

jaclyn-taroni · 2020-02-02T21:18:27Z

Closing this - what's left is now tracked in #497 and #485.

jaclyn-taroni added proposed analysis cnv Related to or requires CNV data labels Oct 30, 2019

jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Oct 30, 2019

cbethell mentioned this issue Oct 31, 2019

PR 1 of 3: focal CN file preparation #195

Merged

8 tasks

This was referenced Nov 1, 2019

Proposed Analysis: Copy number consensus calls #128

Closed

SMARCB1 deletions in ATRT with current SEG to gene mapping #217

Closed

jaclyn-taroni mentioned this issue Nov 4, 2019

Add bedr package to Docker and dependencies #225

Merged

2 tasks

jashapiro mentioned this issue Nov 6, 2019

Planned Analysis: Co-Occurence / Mutual Exclusivity #13

Closed

cbethell mentioned this issue Nov 6, 2019

Focal CN RNA expression #240

Closed

8 tasks

cbethell mentioned this issue Nov 7, 2019

Update oncoprint with focal CN data #243

Merged

8 tasks

jaclyn-taroni mentioned this issue Nov 8, 2019

Co-occurence plot by disease #227

Merged

jaclyn-taroni mentioned this issue Nov 8, 2019

Update CNV segment to gene mapping: support both formats, use GTF, etc. #253

Merged

jaclyn-taroni mentioned this issue Nov 11, 2019

Add X and Y chromosomes to CNV segment to gene symbol mapping #259

Merged

cbethell mentioned this issue Nov 14, 2019

Plot focal CN expression #266

Merged

8 tasks

cbethell mentioned this issue Jan 17, 2020

Update/Revamp focal-cn-file-preparation module #452

Merged

5 tasks

jaclyn-taroni closed this as completed Feb 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Analysis: map from SEG file to genes (and broader segments) #186

Proposed Analysis: map from SEG file to genes (and broader segments) #186

jaclyn-taroni commented Oct 30, 2019 •

edited

Loading

cbethell commented Oct 30, 2019

jaclyn-taroni commented Oct 31, 2019

jharenza commented Oct 31, 2019

jaclyn-taroni commented Oct 31, 2019

jharenza commented Oct 31, 2019

jaclyn-taroni commented Oct 31, 2019

jharenza commented Oct 31, 2019

jharenza commented Nov 4, 2019

jaclyn-taroni commented Nov 4, 2019

jaclyn-taroni commented Nov 5, 2019

jashapiro commented Nov 6, 2019

jaclyn-taroni commented Nov 7, 2019

jashapiro commented Nov 7, 2019

jaclyn-taroni commented Nov 7, 2019

jaclyn-taroni commented Nov 8, 2019

jaclyn-taroni commented Nov 9, 2019

jashapiro commented Nov 9, 2019

jaclyn-taroni commented Jan 14, 2020

cansavvy commented Jan 15, 2020

jaclyn-taroni commented Jan 15, 2020

jaclyn-taroni commented Jan 18, 2020

jaclyn-taroni commented Feb 2, 2020

Proposed Analysis: map from SEG file to genes (and broader segments) #186

Proposed Analysis: map from SEG file to genes (and broader segments) #186

Comments

jaclyn-taroni commented Oct 30, 2019 • edited Loading

Scientific goals

Proposed methods

Required input data

Proposed timeline

cbethell commented Oct 30, 2019

jaclyn-taroni commented Oct 31, 2019

jharenza commented Oct 31, 2019

jaclyn-taroni commented Oct 31, 2019

jharenza commented Oct 31, 2019

jaclyn-taroni commented Oct 31, 2019

jharenza commented Oct 31, 2019

jharenza commented Nov 4, 2019

jaclyn-taroni commented Nov 4, 2019

jaclyn-taroni commented Nov 5, 2019

Changes to file preparation

Changes and additions downstream of file preparation

jashapiro commented Nov 6, 2019

jaclyn-taroni commented Nov 7, 2019

jashapiro commented Nov 7, 2019

jaclyn-taroni commented Nov 7, 2019

jaclyn-taroni commented Nov 8, 2019

jaclyn-taroni commented Nov 9, 2019

jashapiro commented Nov 9, 2019

jaclyn-taroni commented Jan 14, 2020

cansavvy commented Jan 15, 2020

jaclyn-taroni commented Jan 15, 2020

jaclyn-taroni commented Jan 18, 2020

jaclyn-taroni commented Feb 2, 2020

jaclyn-taroni commented Oct 30, 2019 •

edited

Loading