-
Notifications
You must be signed in to change notification settings - Fork 67
add fusion summary for LGAT #830
add fusion summary for LGAT #830
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just a few questions/ suggestions I had are commented below.
I wonder if the CI is errors out with the test data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell, it looks like this implements what's suggested in #808. 👍
My suggestions are mainly about where we might be able to shorten this up and add clarity in certain parts.
Gene2B %in% genes) | ||
} | ||
return(df %>% | ||
select(Sample, FusionName, Fusion_Type, Gene1A, Gene1B, Gene2A, Gene2B, Gene1A_anno, Gene1B_anno, LeftBreakpoint, RightBreakpoint, reciprocal_exists, DomainRetainedGene1A, DomainRetainedGene1B) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me what the pattern is for what columns are being selected here. Can you tell me what the particular pattern or strategy for these columns being selected is?
Maybe we can come up with a more succinct way of selecting these if there's a pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this, since we are specifically looking for in-frame (Fusion_Type
), kinase (Gene1A_anno
, Gene1B_anno
), whether a reciprocal exists (reciprocal_exists
), and fusions with kinase domains retained (DomainRetainedGene1A
, DomainRetainedGene1B
), these columns required for LGAT fusion compilation. I added the left and right breakpoint columns to make sure that I was seeing unique fusions, although the rows may be unique enough with the combination of columns above. Let me check on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the number of unique fusions without breakpoints == 265 and with breakpoints == 300, but the results are identical. I will remove those two columns!
fix variable name subsetFuseLGAT
add comments from @cansavvy and @kgaonkar6 code review
consolidate filterFusion() and filterLGATFusion() functions
- select only BS_ID and fusion - add distinct
@cansavvy and @kgaonkar6 - just made all of the updates and requested re-review. I actually found a bug in the 5' in-frame fusion filtering and had missed 3 @cansavvy thank you SO much for your helpful tips - this was my first tidyverse code :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jharenza, thanks for these changes, I think this is coming along nicely! I just have one more area that I think we should further evaluate as far as what the code is doing.
|
||
Next, filter all fusions for 5' kinase fusions which have lost the kinase domain for reciprocal fusions which have a kinase domain intact and are in-frame. | ||
Keep these for the final output file. | ||
```{r} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking this chunk can still be simplified. In particular, I'm noticing that five_prime_domain_lost
is from allFuseLGAT
and five_prime_reciprocals
is from five_prime_domain_lost
but then we are adding back info from allFuseLGAT
.
So basically both five_prime_domain_lost
and five_prime_reciprocals
are based on allFuseLGAT
but we have extra manipulation steps. I think we can do this more directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I'm trying to make this shorter, I'm a bit confused on these steps.
I tried to boil it down to what looks like the essential steps to get five_prime_kinase_keep
, but as of now it does not get the same result. I think part of this is I still don't really understand all that is happening in this chunk. Can you take a look at my attempt as well evaluating your original code to see how we can make these steps more efficient and clear?
In particular I know the differences have to do with the paired up "select() then distinct()" steps that are done a few timees, but its unclear to me why each of these "select() then distinct()" steps are necessary.
five_prime_kinase_keep <- allFuseLGAT %>%
# Make reciprocal variable
mutate(reciprocal = paste(Gene1B, Gene1A, sep ="--")) %>%
filter(# Keep the in-frame reciprocals which have the kinase domain in tact.
grepl("Kinase", Gene1A_anno) & reciprocal_exists == "TRUE" & DomainRetainedGene1A == "No" |
# Also keep select the in-frame reciprocals which have the kinase domain in tact.
Fusion_Type == "in-frame" & DomainRetainedGene1B == "Yes") %>%
# Select to which variables we want here
select(Sample, FusionName = reciprocal) %>%
distinct()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh
grepl("Kinase", Gene1A_anno) & reciprocal_exists == "TRUE" & DomainRetainedGene1A == "No" |
# Also keep select the in-frame reciprocals which have the kinase domain in tact.
Fusion_Type == "in-frame" & DomainRetainedGene1B == "Yes")
This filtering you have here is filtering on the frame-ness/domain retention of the 5' kinase fusions and not their reciprocals (3' kinase fusions), which is why I was making that variable, then re-joining five_prime_reciprocals
with the allFuseLGAT
.
What I am trying to do is find any 5' kinases without the domain retained, but then go back to that original dataframe allFuseLGAT
and collect the annotations of their reciprocal fusions (ie 3' kinase fusions' frameness and domain retention), assess whether those are in-frame and retain the kinase domain, then if there is a reciprocal that satisfies that, join both the 5' kinase fusion dataframe (domain lost) and 3' kinase fusion dataframe (inframe/domain retained).
Actually, I can remove:
# Add back the reciprocal of the reciprocal (the 5' kinase fusion) and keep this list
five_prime_kinase_keep <- five_prime_reciprocals %>%
mutate(reciprocal = paste(Gene1B, Gene1A, sep ="--")) %>%
select(Sample, FusionName = reciprocal) %>%
distinct()
and use:
# Then, select the in-frame reciprocals which have the kinase domain in tact. Retain 5' kinase fusion information and update 3' fusion column name to FusionName for merging with allFuseLGAT.
five_prime_kinase_keep <- five_prime_domain_lost %>%
select(Sample, five_prime_kinase = FusionName, FusionName = three_prime_kinase) %>%
left_join(allFuseLGAT, by = c("Sample", "FusionName")) %>%
filter(Fusion_Type == "in-frame" & DomainRetainedGene1B == "Yes") %>%
select(Sample, FusionName = five_prime_kinase) %>%
distinct()
distinct()
is needed because in some cases, there are multiple breakpoints for each fusion and sometimes multiple can have the same frame-ness, so this reduces them to one fusion call per sample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! Thanks for those updates! I think this looks good!
update code from @cansavvy code review
for annot file, update read.delim2 to read_tsv since [AlexsLemonade#836](AlexsLemonade#836) has gone in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! Thanks for the clarification!
|
||
Next, filter all fusions for 5' kinase fusions which have lost the kinase domain for reciprocal fusions which have a kinase domain intact and are in-frame. | ||
Keep these for the final output file. | ||
```{r} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! Thanks for those updates! I think this looks good!
Purpose/implementation Section
What scientific question is your analysis addressing?
Adding fusion summary for LGAT tumors in preparation for #793
What was your approach?
BRAF
ALK
ROS1
NTRK1
NTRK2
NTRK3
PDGFRA
FGFR2
FGFR1
MYB
MYBL1
RAF1
What GitHub issue does your pull request address?
#808
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Double check the logic
Is there anything that you want to discuss further?
I kept all of this analysis in the same Rmd as the other fusion summary analyses because I am using / modifying those functions. I almost think some of this could be its own notebook since it is unique to LGAT, but could go either way. Thoughts on this?
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes
Results
What types of results are included (e.g., table, figure)?
table
What is your summary of the results?
Numbers of unique patient fusions:
Non-kinase fusions: 11
3' kinase fusions which are in-frame and retain the kinase domain: 167
5' kinase fusions, added those which are in-frame and retain the kinase domain: 1
5' kinase fusions that don't meet 3., has a reciprocal fusion in-frame and the kinase domain is retained: 12
Combined these into fusion_summary_lgat_foi.tsv
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.