Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Planned Analysis: Filter and Annotate Fusions #39

Closed
jharenza opened this issue Aug 7, 2019 · 14 comments
Closed

Planned Analysis: Filter and Annotate Fusions #39

jharenza opened this issue Aug 7, 2019 · 14 comments
Labels
fusion Related to or requires fusion data in progress Someone is working on this issue, but feel free to propose an alternative approach! transcriptomic Related to or requires transcriptomic data

Comments

@jharenza
Copy link
Collaborator

jharenza commented Aug 7, 2019

Here, we will filter potential artifacts, filter fusions observed in normal tissue, retain high-confidence calls, and annotate with several databases to create a final list of putative driver fusions.

@jharenza jharenza added in progress Someone is working on this issue, but feel free to propose an alternative approach! good first issue Good for newcomers labels Aug 7, 2019
@jharenza
Copy link
Collaborator Author

jharenza commented Aug 7, 2019

@kgaonkar6 and I have created a workflow for this, will update soon!

@jaclyn-taroni jaclyn-taroni removed the good first issue Good for newcomers label Aug 15, 2019
@jaclyn-taroni
Copy link
Member

Is this pipeline described in AlexsLemonade/OpenPBTA-manuscript#21?

@jharenza
Copy link
Collaborator Author

Is this pipeline described in AlexsLemonade/OpenPBTA-manuscript#21?

It is not yet described there. Do you think we should add it to the methods section or part of the analysis/results section? We created a workflow (and @kgaonkar6 created an R package) to annotate fusion gene partners as TSG, oncogene, kinase, TF, receptor, added expression of each gene, filtered out artifacts, fusions in normal tissues, etc, to come up with a high-confidence list of putative driver fusions. It is probably more of a method, but weren't going to add until this PR was finished - still making tweaks.

@jaclyn-taroni
Copy link
Member

It is probably more of a method, but weren't going to add until this PR was finished - still making tweaks.

Ah, sounds good. This is the order I would expect. Are the Fusion prioritization steps described in AlexsLemonade/OpenPBTA-manuscript#21 then upstream, e.g., they have been performed on the fusion TSV files?

@jharenza
Copy link
Collaborator Author

They are all post data download of TSV files.

@jharenza
Copy link
Collaborator Author

@jaclyn-taroni @cgreene - seeking advice on this PR. We plan to create a package to do the annotations and prioritization, but have some bugs currently. We were thinking of creating code for the PR that would use the new tool and spit out the results (TXT file and figures), but in the meantime, for this PR, would you rather us contribute the entirety of the code as we have in this repo https://github.com/d3b-center/fusion_filtering_pipeline? It has been a work in progress for several months, so may be a lot to go through for the purposes of the PR. cc: @kgaonkar6

@jaclyn-taroni
Copy link
Member

It would be great to have that as a reusable analysis workflow. Sounds like that's your goal with creating a package. If you're open to it, we could put that code through code review, as it is often helpful to have some fresh eyes on a piece of work when the goal is to make something more generalizable/reusable.

Before we figure out the mechanics of getting it through review and which repository, etc., I have a few questions. The most important of which is: what is the broad idea of what this pipeline does?

Follow-up questions: What are the inputs to the pipeline? Can you make the files you are using as input public?

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Aug 28, 2019

To clarify, this is very helpful:

Here, we will filter potential artifacts, filter fusions observed in normal tissue, retain high-confidence calls, and annotate with several databases to create a final list of putative driver fusions.

I'm wondering about things like where the fusions observed in normal tissue information is coming from.

@jharenza
Copy link
Collaborator Author

Good point - I think a code review would be helpful. The goals of the package would be

  1. annotation
  2. prioritization of candidate fusions

Inputs are the fusion output files from arriba and star-fusion. While we are only using these two algorithms, in the past, we have run 4 other algorithms and plan to add the capability of some of those output files as input to this package.

There are a host of annotation tools and databases used and for normal fusion removal, we are using Fusion Annotator and Arriba has its own blacklist. Now that I am writing this, I think we should remove from STAR-fusion the fusions present in the arriba blacklist. There are a lot of pieces to this, so you will see. Hope to have the PR submitted today or tomorrow with what we have to date.

@jaclyn-taroni
Copy link
Member

There are a lot of pieces to this, so you will see. Hope to have the PR submitted today or tomorrow with what we have to date.

What will the planned PR consist of? Will it be some wrapper script that calls the code in https://github.com/d3b-center/fusion_filtering_pipeline? As you state, there's quite a bit of code in that repository. It would be infeasible to review it well all at once. Is the plan to submit a draft pull request and that's where we'll discuss splitting it up (per https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/CONTRIBUTING.md#size-and-composition-of-pull-requests)?

@jharenza
Copy link
Collaborator Author

Yes, the plan is to submit a bash script to call all of those scripts in the correct order. Also just realized we annotated the arriba fusions to be consistently annotated with star fusion and rather than having users reproduce that via this PR (requires 7GB database download), we will release the annotated fusions in V3. Will try to get this released today with @yuankunzhu.

@jaclyn-taroni
Copy link
Member

Okay. Having the order will be helpful in figuring out next steps. If more context is needed, we can discuss here or on the pull request.

@syzheng
Copy link

syzheng commented Oct 4, 2019

To clarify, this is very helpful:

Here, we will filter potential artifacts, filter fusions observed in normal tissue, retain high-confidence calls, and annotate with several databases to create a final list of putative driver fusions.

I'm wondering about things like where the fusions observed in normal tissue information is coming from.

it can be from multiple resources. For instance, TCGA normal samples have been analyzed for fusions; but the best source might be GTEx. One issue with using normal sample for filtering is that the normal samples should be analyzed with the same pipeline used for cancer so to minimize tool introduced artifacts.

@jaclyn-taroni jaclyn-taroni added fusion Related to or requires fusion data transcriptomic Related to or requires transcriptomic data labels Oct 26, 2019
This was referenced Nov 15, 2019
@jharenza jharenza mentioned this issue Nov 25, 2019
2 tasks
@jharenza
Copy link
Collaborator Author

jharenza commented Dec 2, 2019

closed with #294 #300 #277 #267

@jharenza jharenza closed this as completed Dec 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
fusion Related to or requires fusion data in progress Someone is working on this issue, but feel free to propose an alternative approach! transcriptomic Related to or requires transcriptomic data
Projects
None yet
Development

No branches or pull requests

3 participants