Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

add fusion summary for LGAT #830

Merged
merged 9 commits into from
Nov 12, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 154 additions & 5 deletions analyses/fusion-summary/01-fusion-summary.Rmd
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
---
title: "Generate Fusion Summary Files"
output: html_notebook
author: Daniel Miller (D3b) and Jaclyn Taroni (CCDL)
date: January 2020
author: Daniel Miller (D3b), Jaclyn Taroni (CCDL), Jo Lynne Rokita (D3b)
date: January 2020, November 2020
---

Generate fusion files specifically for consumption by molecular subtyping analyses
Expand Down Expand Up @@ -44,6 +44,30 @@ filterFusion <- function(df, bioid, fuses, genes) {
return(df %>% select(Sample, FusionName))
}

#create function specifically for lgat, since we want more information from the putative oncogenic file
filterLGATFusion <- function(df, bioid, fuses, genes) {
jharenza marked this conversation as resolved.
Show resolved Hide resolved
if (!missing(bioid)) {
df <- filter(df, Sample %in% bioid)
}
if (!missing(fuses) & !missing(genes)) {
df <- filter(df, FusionName %in% fuses |
Gene1A %in% genes |
Gene2A %in% genes |
Gene1B %in% genes |
Gene2B %in% genes)
} else if (!missing(fuses)) {
df <- filter(df, FusionName %in% fuses)
} else if (!missing(genes)) {
df <- filter(df,
Gene1A %in% genes |
Gene2A %in% genes |
Gene1B %in% genes |
Gene2B %in% genes)
}
return(df %>%
select(Sample, FusionName, Fusion_Type, Gene1A, Gene1B, Gene2A, Gene2B, Gene1A_anno, Gene1B_anno, LeftBreakpoint, RightBreakpoint, reciprocal_exists, DomainRetainedGene1A, DomainRetainedGene1B) %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me what the pattern is for what columns are being selected here. Can you tell me what the particular pattern or strategy for these columns being selected is?

Maybe we can come up with a more succinct way of selecting these if there's a pattern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this, since we are specifically looking for in-frame (Fusion_Type), kinase (Gene1A_anno, Gene1B_anno), whether a reciprocal exists (reciprocal_exists ), and fusions with kinase domains retained (DomainRetainedGene1A, DomainRetainedGene1B), these columns required for LGAT fusion compilation. I added the left and right breakpoint columns to make sure that I was seeing unique fusions, although the rows may be unique enough with the combination of columns above. Let me check on this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the number of unique fusions without breakpoints == 265 and with breakpoints == 300, but the results are identical. I will remove those two columns!

distinct())
}

#' Generate matrix with fusion counts
#' @param fuseDF Filtered fusion data frame
Expand All @@ -67,10 +91,14 @@ prepareOutput <- function(fuseDF, bioid) {

```{r}
dataDir <- file.path("..", "..", "data")
fusDir <- file.path("..", "..", "analyses", "fusion_filtering", "results")
annotDir <- file.path("..", "..", "analyses", "fusion_filtering", "references")
#' The putative oncogenic fusion file is what we'll use to check for the
#' annotation file
annot <- read.delim2(file.path(annotDir, "genelistreference.txt"), sep = "\t", header = T)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't change to readr::read_tsv(), we will need to add one more argument here:

Suggested change
annot <- read.delim2(file.path(annotDir, "genelistreference.txt"), sep = "\t", header = T)
annot <- read.delim2(file.path(annotDir, "genelistreference.txt"), sep = "\t", header = TRUE, stringsAsFactors = FALSE)

#' presence or absence of the fusions.
putativeOncogenicDF <-
read_tsv(file.path(dataDir, "pbta-fusion-putative-oncogenic.tsv"))
read_tsv(file.path(fusDir, "pbta-fusion-putative-oncogenic.tsv"))
jharenza marked this conversation as resolved.
Show resolved Hide resolved
#' However, some biospecimens are not represented in this filtered, prioritized
#' file but *are* present in the original files -- this will cause them to be
#' "missing" in the final files for consumption which could mislead analysts.
Expand All @@ -88,11 +116,12 @@ if (!dir.exists(resultsDir)) {
ependFile <- file.path(resultsDir, "fusion_summary_ependymoma_foi.tsv")
embryFile <- file.path(resultsDir, "fusion_summary_embryonal_foi.tsv")
ewingsFile <- file.path(resultsDir, "fusion_summary_ewings_foi.tsv")
lgatFile <- file.path(resultsDir, "fusion_summary_lgat_foi.tsv")
```

## Fusions and genes of interest

Taken from [`AlexsLemonade/OpenPBTA-analysis#245`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/245), [`AlexsLemonade/OpenPBTA-analysis#251`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/251), and [`AlexsLemonade/OpenPBTA-analysis#623`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/623) respectively.
Taken from [`AlexsLemonade/OpenPBTA-analysis#245`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/245), [`AlexsLemonade/OpenPBTA-analysis#251`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/251), [`AlexsLemonade/OpenPBTA-analysis#623`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/623) respectively, and [`AlexsLemonade/OpenPBTA-analysis#808`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/808)

```{r}
#' **Filters**
Expand Down Expand Up @@ -132,6 +161,28 @@ ewingsFuses<- c(
"FUS--ERG"
)

#' 4: Exact match a list of fusions common in low-grade astrocytic (LGAT) tumors
#' as well as fusions containing a particular gene with any other gene
lgatFuses <- c(
"KIAA1549--BRAF",
"FGFR1--TACC1",
"MYB--QKI"
)
lgatGenes <- c(
"BRAF",
"ALK",
"ROS1",
"NTRK1",
"NTRK2",
"NTRK3",
"PDGFRA",
"FGFR2",
"FGFR1",
"MYB",
"MYBL1",
"RAF1"
)

```

### Filter putative oncogenic fusions list
Expand All @@ -145,6 +196,9 @@ allFuseEmbry <- filterFusion(df = putativeOncogenicDF,
genes = embryGenes)
allFuseEwing <- filterFusion(df = putativeOncogenicDF,
fuses = ewingsFuses)
allFuseLGAT <- filterLGATFusion(df = putativeOncogenicDF,
fuses = lgatFuses,
genes = lgatGenes)

```

Expand Down Expand Up @@ -203,9 +257,104 @@ allFuseEwing %>%
write_tsv(ewingsFile)
```

#### Perform selection for LGAT fusions
jharenza marked this conversation as resolved.
Show resolved Hide resolved

```{r}
# Which genes/fusions are not kinases, but in the list?
# Separate LGAT fusions into genes, combine with gene list, check for not kinase
lgatFuses_df <- as.data.frame(lgatFuses) %>%
separate(lgatFuses, into = c("Gene1A", "Gene1B"), remove = F)
jharenza marked this conversation as resolved.
Show resolved Hide resolved
kinases <- annot %>%
filter(type == "Kinase")
jharenza marked this conversation as resolved.
Show resolved Hide resolved
lgatFuses_df$Gene1A_anno <- ifelse(lgatFuses_df$Gene1A %in% kinases$Gene_Symbol, "kinase", "")
jharenza marked this conversation as resolved.
Show resolved Hide resolved
lgatFuses_df$Gene1B_anno <- ifelse(lgatFuses_df$Gene1B %in% kinases$Gene_Symbol, "kinase", "")
# Only pull fusions that do not contain kinase genes, as ones with kinases will be dealt with sepaately later
jharenza marked this conversation as resolved.
Show resolved Hide resolved
lgatFuses_df <- lgatFuses_df %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm interpreting the end result of this correctly, it looks like we want a vector of nonkinase fuses. If so, we can get to that slightly more directly:

Suggested change
lgatFuses_df <- lgatFuses_df %>%
nonkinase_lgatFuses <- lgatFuses_df %>%

filter(lgatFuses_df$Gene1A_anno != "kinase" & lgatFuses_df$Gene1B_anno != "kinase") %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tidyverse doesn't need the lgatFuses_df$ bits.

Suggested change
filter(lgatFuses_df$Gene1A_anno != "kinase" & lgatFuses_df$Gene1B_anno != "kinase") %>%
filter(Gene1A_anno != "kinase" & Gene1B_anno != "kinase") %>%

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also good tip!

select(lgatFuses)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a vector of nonkinase fuses then we can use pull() and do this in one shot.

Suggested change
select(lgatFuses)
pull(lgatFuses)

nonkinase_lgatFuses <- lgatFuses_df$lgatFuses

# Identify non-kinase genes in LGAT goi list
nonkinase_lgatGenes <- setdiff(lgatGenes, kinases$Gene_Symbol)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use the suggestions above, then this will have to change:

Suggested change
nonkinase_lgatGenes <- setdiff(lgatGenes, kinases$Gene_Symbol)
nonkinase_lgatGenes <- setdiff(lgatGenes, kinases)


# Pull LGAT non-kinase fusions
nonkinaseLGAT <- filterFusion(df = putativeOncogenicDF,
fuses = nonkinase_lgatFuses,
genes = nonkinase_lgatGenes) %>%
distinct()
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

# First, filter all fusion dataframe for 3' kinases which are in-frame and retain the kinase domain - keep these
three_prime_kinase_inframe <- allFuseLGAT %>%
filter(grepl("Kinase", Gene1B_anno) & Fusion_Type == "in-frame" & DomainRetainedGene1B == "Yes") %>%
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
select(Sample, FusionName, Gene1A, Gene1B) %>%
distinct()

# Which fusions are not in-frame?
three_prime_kinase_outframe <- allFuseLGAT %>%
filter(grepl("Kinase", Gene1B_anno) & Fusion_Type != "in-frame") %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic question: If we have an in-frame fusion but the domain is not retained, we aren't counting it as in frame? So are there fusions that are in-frame but not retaining the domain that will not be in either dataframes being set up here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added some checks to check these and commented on the code. You are correct, there are two fusions which are in-frame, but do not retain the kinase domain. In case 1, there is another in-frame fusion with the kinase domain, so it is being captured in the three_prime_kinase_inframe list, but in case 2, the in-frame fusion does not retain the kinase domain and so we are not keeping.

distinct()
# Check they don't retain the kinase domain. They do not, so leave out.
table(three_prime_kinase_outframe$DomainRetainedGene1B)
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

# Filter all fusions for 5' kinase fusions which have reciprocal fusions and have lost the kinase domain.
five_prime_domain_lost <- allFuseLGAT %>%
filter(grepl("Kinase", Gene1A_anno) & reciprocal_exists == "TRUE") %>%
select(Sample, FusionName, Gene1A, Gene1B, Fusion_Type, DomainRetainedGene1A) %>%
filter(DomainRetainedGene1A == "No") %>%
distinct()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
distinct()
distinct() %>%
mutate(reciprocal = paste(Gene1B, Gene1A, sep ="--"))


# Keep those with kinase domain retained and fusion in-frame - keep this list
five_prime_domain_intact <- allFuseLGAT %>%
filter(grepl("Kinase", Gene1A_anno) & reciprocal_exists == "TRUE") %>%
jharenza marked this conversation as resolved.
Show resolved Hide resolved
select(Sample, FusionName, Gene1A, Gene1B, Fusion_Type, DomainRetainedGene1A) %>%
filter(DomainRetainedGene1A == "Yes") %>%
distinct()

# Identify reciprocal matched fusions which are in-frame
five_prime_domain_lost$reciprocal <- paste(five_prime_domain_lost$Gene1B, five_prime_domain_lost$Gene1A, sep ="--")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can add this on to the set of tidyverse manipulations you have above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put a suggestion up there.

five_prime_reciprocals <- five_prime_domain_lost[,c("Sample", "reciprocal")] %>%
jharenza marked this conversation as resolved.
Show resolved Hide resolved
rename(FusionName = reciprocal)
five_prime_reciprocals_full <- five_prime_reciprocals %>%
jharenza marked this conversation as resolved.
Show resolved Hide resolved
left_join(allFuseLGAT, by = c("Sample", "FusionName")) %>%
filter(Fusion_Type == "in-frame")
# Keep the list of the original 5' kinase fusions with matched reciprocals
five_prime_kinase_keep <- five_prime_reciprocals_full[,c("Sample", "Gene1A", "Gene1B")] %>%
distinct()
five_prime_kinase_keep$FusionName <- paste(five_prime_kinase_keep$Gene1B, five_prime_kinase_keep$Gene1A, sep ="--")
jharenza marked this conversation as resolved.
Show resolved Hide resolved

# Which ones are not in-frame? These are mostly genes fused to self
out_frame_reciprocals <- setdiff(five_prime_reciprocals[,c("Sample", "FusionName")], five_prime_reciprocals_full[,c("Sample", "FusionName")])
jharenza marked this conversation as resolved.
Show resolved Hide resolved
out_frame_reciprocals

#Check these do not retain the kinase domain. They do not, so will leave out
out_frame_reciprocals_full <- out_frame_reciprocals %>%
left_join(allFuseLGAT, by = c("Sample", "FusionName")) %>%
select(Sample, FusionName, Gene1A, Gene1B, Fusion_Type, DomainRetainedGene1A, DomainRetainedGene1B)

# Rbind lists for final table of LGAT fusions of interest
subsetFuseLGAT <- bind_rows(nonkinaseLGAT, three_prime_kinase_inframe[,c("Sample", "FusionName")],
jharenza marked this conversation as resolved.
Show resolved Hide resolved
five_prime_domain_intact[,c("Sample", "FusionName")],
five_prime_kinase_keep[,c("Sample", "FusionName")]) %>%
distinct()
```

#### Write LGAT fusions to file

```{r}
subFuseLGAT <- subFuseLGAT %>%
prepareOutput(specimensUnion)

# Which fusions of interest are not present in any samples?
missingLgatFusion <- setdiff(lgatFuses, colnames(subFuseLGAT)[-1])
# For the fusions that are not present, fill those columns with 0
subFuseLGAT[, missingLgatFusion] <- 0

subFuseLGAT %>%
write_tsv(lgatFile)
```

## Session Info

```{r}
sessionInfo()
```

Loading