-
Notifications
You must be signed in to change notification settings - Fork 67
Added new fusion summary module #410
Added new fusion summary module #410
Conversation
updated anaylsis readme and ci config
Hi @dmiller15 , thanks for getting started on this analysis. I see in the originating Issue #398, the analysis requested is to search all samples for either classification or reclassification (my emphasis added):
It looks like you've subsetted the data to only certain histologies before searching for fusions of interest. Can you update the code to search all samples even those that have already been subtyped? Thanks! |
Thanks for the input @sjspielman. I've added files for each set of fusions that don't filter the biospecimens beforehand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dmiller15 - thanks for this contribution! This is in good shape and is well-documented. I had some comments about some overarching design decisions in addition to the line comments I left:
- I would remove the filtering by
short_histology
andbroad_histology
now that you include the files without this filtering. - There are 924 samples in the
pbta-fusion-recurrently-fused-genes-bysample.tsv
, which is a binary matrix that contains information about the presence or absence of a recurrently fused gene in an RNA-seq sample. This file is very similar to what we want here. The main difference is that the columns of that file are data-driven (e.g., based on the number of samples they appeared in) and here we are specifying the fusions upfront. The files produced here have under 40 samples, and if I followed correctly, I believe this is due to the inclusion of only the samples (Kids_First_Biospecimen_ID
) that have at least one of the fusions or genes that are being specified. We want to include all samples. Here's where @kgaonkar6 starts creating thepbta-fusion-recurrently-fused-genes-bysample.tsv
matrix for your reference:OpenPBTA-analysis/analyses/fusion_filtering/05-recurrent-fusions-per-histology.R
Line 121 in 7917a7f
# binary matrix for recurrent fusions found in SAMPLE per broad_histology - In some cases, it's not clear that the 5'/3' ordering of the genes matters, so
MN1--BEND2
andBEND2--MN1
may be equivalent for the purposes of these files. I've asked @jharenza to weigh in.
added RELA gene filtering no longer drop levels of samples
@sjspielman With regard to the above, I no longer drop the levels when making the table. You can inspect the newly generated outputs and see 787 samples, which is the number of unique biospecimens available in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 looks good to me - thank you for the changes @dmiller15 !
Realized after approving that we no longer needed the demographic file - so I removed that in bba4e78. I'll merge once CI finishes! |
updated anaylsis readme and ci config
Purpose/implementation Section
What scientific question is your analysis addressing?
In biospecimens from ependymoma and non-ATRT embryonal tumors, which ones have fusions of interest?
What was your approach?
Using the biospecimens that were identified to be in one or the other population, we filtered the fusions file. That filtered fusion file was simplified using a list of explicit and generic filters for fusions that have been shown to have relevance in the cancer of that population. The results were summarized in a TSV for each population.
What GitHub issue does your pull request address?
#398
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
The filters are of particular note since those values were mostly gleaned from the ticket. Perhaps there might be better ways of performing the filtering. Also the final summary tables aren't exactly set in stone. Particular for how we represent the generic fusions where we accept any fusion containing a particular gene. I simply left all the generic fusions found as unique columns. The ticket chose to aggregate these values, but the utility of that is up for discussion.
Is there anything that you want to discuss further?
No.
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes. I'm confident the tables are presenting the information requested in the ticket.
Results
What types of results are included (e.g., table, figure)?
TSV tables.
What is your summary of the results?
C11orf95--RELA is nearly universal in Ependymoma tumors. Very few biospecimens have the fusions of interest. Almost none of the samples have more than one fusion.
Reproducibility Checklist
analyses/README.md
.