Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

add fusion summary for LGAT #830

Merged
merged 9 commits into from
Nov 12, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,10 @@ jobs:
name: Collapse RSEM
command: ./scripts/run_in_ci.sh bash analyses/collapse-rnaseq/run-collapse-rnaseq.sh

- run:
name: Fusion Summary
command: ./scripts/run_in_ci.sh bash "analyses/fusion-summary/run-new-analysis.sh"

- run:
name: Immune deconvolution using xCell and MCP-Counter
command: OPENPBTA_DECONV_METHOD="mcp_counter" ./scripts/run_in_ci.sh bash analyses/immune-deconv/run-immune-deconv.sh
Expand Down Expand Up @@ -184,10 +188,6 @@ jobs:
name: Gene set enrichment analysis to generate GSVA scores
command: OPENPBTA_TESTING=1 ./scripts/run_in_ci.sh bash "analyses/gene-set-enrichment-analysis/run-gsea.sh"

- run:
name: Fusion Summary
command: OPENPBTA_TESTING=1 ./scripts/run_in_ci.sh bash "analyses/fusion-summary/run-new-analysis.sh"

- run:
name: Add Shatterseek
command: ./scripts/run_in_ci.sh Rscript analyses/sv-analysis/02-shatterseek.R
Expand Down
167 changes: 157 additions & 10 deletions analyses/fusion-summary/01-fusion-summary.Rmd
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
---
title: "Generate Fusion Summary Files"
output: html_notebook
author: Daniel Miller (D3b) and Jaclyn Taroni (CCDL)
date: January 2020
author: Daniel Miller (D3b), Jaclyn Taroni (CCDL), Jo Lynne Rokita (D3b)
date: January 2020, November 2020
---

Generate fusion files specifically for consumption by molecular subtyping analyses
Expand Down Expand Up @@ -41,14 +41,14 @@ filterFusion <- function(df, bioid, fuses, genes) {
Gene1B %in% genes |
Gene2B %in% genes)
}
return(df %>% select(Sample, FusionName))
return(df %>%
select(Sample, FusionName, Fusion_Type, Gene1A, Gene1B, Gene2A, Gene2B, Gene1A_anno, Gene1B_anno, reciprocal_exists, DomainRetainedGene1A, DomainRetainedGene1B) %>%
distinct())
}


#' Generate matrix with fusion counts
#' @param fuseDF Filtered fusion data frame
#' @param bioid List of biospecimen IDs that should be included in final table

#' @return Data frame that contains fusion counts
prepareOutput <- function(fuseDF, bioid) {
fuseDF %>%
Expand All @@ -67,10 +67,13 @@ prepareOutput <- function(fuseDF, bioid) {

```{r}
dataDir <- file.path("..", "..", "data")
#' The putative oncogenic fusion file is what we'll use to check for the
#' presence or absence of the fusions.
fusDir <- file.path("..", "..", "analyses", "fusion_filtering", "results")
annotDir <- file.path("..", "..", "analyses", "fusion_filtering", "references")
#' Annotation file to be used for identifying kinase genes
annot <- read_tsv(file.path(annotDir, "genelistreference.txt"))
#' The putative oncogenic fusion file is what we'll use to check for the #' presence or absence of the fusions.
putativeOncogenicDF <-
read_tsv(file.path(dataDir, "pbta-fusion-putative-oncogenic.tsv"))
read_tsv(file.path(fusDir, "pbta-fusion-putative-oncogenic.tsv"))
#' However, some biospecimens are not represented in this filtered, prioritized
#' file but *are* present in the original files -- this will cause them to be
#' "missing" in the final files for consumption which could mislead analysts.
Expand All @@ -88,11 +91,12 @@ if (!dir.exists(resultsDir)) {
ependFile <- file.path(resultsDir, "fusion_summary_ependymoma_foi.tsv")
embryFile <- file.path(resultsDir, "fusion_summary_embryonal_foi.tsv")
ewingsFile <- file.path(resultsDir, "fusion_summary_ewings_foi.tsv")
lgatFile <- file.path(resultsDir, "fusion_summary_lgat_foi.tsv")
```

## Fusions and genes of interest

Taken from [`AlexsLemonade/OpenPBTA-analysis#245`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/245), [`AlexsLemonade/OpenPBTA-analysis#251`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/251), and [`AlexsLemonade/OpenPBTA-analysis#623`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/623) respectively.
Taken from [`AlexsLemonade/OpenPBTA-analysis#245`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/245), [`AlexsLemonade/OpenPBTA-analysis#251`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/251), [`AlexsLemonade/OpenPBTA-analysis#623`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/623) respectively, and [`AlexsLemonade/OpenPBTA-analysis#808`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/808)

```{r}
#' **Filters**
Expand Down Expand Up @@ -132,6 +136,28 @@ ewingsFuses<- c(
"FUS--ERG"
)

#' 4: Exact match a list of fusions common in low-grade astrocytic (LGAT) tumors
#' as well as fusions containing a particular gene with any other gene
lgatFuses <- c(
"KIAA1549--BRAF",
"FGFR1--TACC1",
"MYB--QKI"
)
lgatGenes <- c(
"BRAF",
"ALK",
"ROS1",
"NTRK1",
"NTRK2",
"NTRK3",
"PDGFRA",
"FGFR2",
"FGFR1",
"MYB",
"MYBL1",
"RAF1"
)

```

### Filter putative oncogenic fusions list
Expand All @@ -145,6 +171,9 @@ allFuseEmbry <- filterFusion(df = putativeOncogenicDF,
genes = embryGenes)
allFuseEwing <- filterFusion(df = putativeOncogenicDF,
fuses = ewingsFuses)
allFuseLGAT <- filterFusion(df = putativeOncogenicDF,
fuses = lgatFuses,
genes = lgatGenes)

```

Expand Down Expand Up @@ -203,9 +232,127 @@ allFuseEwing %>%
write_tsv(ewingsFile)
```

#### Perform selection for LGAT fusions
jharenza marked this conversation as resolved.
Show resolved Hide resolved
First pull the fusions or genes from the goi list which are not kinases for the final output file, since these will not need further interrogation.
```{r}
# Which genes/fusions are not kinases, but in the list?
# Separate LGAT fusions into genes, combine with gene list, check for not kinase
lgatFuses_df <- as.data.frame(lgatFuses) %>%
separate(lgatFuses, into = c("Gene1A", "Gene1B"), remove = F)
jharenza marked this conversation as resolved.
Show resolved Hide resolved

kinases <- annot %>%
filter(type == "Kinase") %>%
pull(Gene_Symbol)

lgatFuses_df <- lgatFuses_df %>%
dplyr::mutate(Gene1A_anno = case_when(Gene1A %in% kinases ~ "Kinase", TRUE ~ "Non-Kinase"),
Gene1B_anno = case_when(Gene1B %in% kinases ~ "Kinase", TRUE ~ "Non-Kinase"))

# Only pull fusions that do not contain kinase genes, as ones with kinases will be dealt with separately later
nonkinase_lgatFuses <- lgatFuses_df %>%
filter(Gene1A_anno == "Non-Kinase" & Gene1B_anno == "Non-Kinase") %>%
pull(lgatFuses)

# Identify non-kinase genes in LGAT goi list
nonkinase_lgatGenes <- setdiff(lgatGenes, kinases)

# Pull LGAT non-kinase fusions
nonkinaseLGAT <- filterFusion(df = putativeOncogenicDF,
fuses = nonkinase_lgatFuses,
genes = nonkinase_lgatGenes) %>%
distinct()
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
```

Next, collect fusions which contain 3' kinases which are in-frame and retain the kinase domain.
Keep these for the final output file.
```{r}
three_prime_kinase_inframe <- allFuseLGAT %>%
filter(grepl("Kinase", Gene1B_anno) & Fusion_Type == "in-frame" & DomainRetainedGene1B == "Yes") %>%
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
select(Sample, FusionName, Gene1A, Gene1B) %>%
distinct()

# Are there any fusions that are in-frame, but do not retain the kinase domain? Do they have in-frame fusions retaining the kinase domain in the same fusion?
three_prime_kinase_outframe <- allFuseLGAT %>%
filter(grepl("Kinase", Gene1B_anno) & Fusion_Type == "in-frame" & DomainRetainedGene1B == "No") %>%
select(Sample, FusionName) %>%
distinct()

three_prime_kinase_outframe
```

Let's look at these just to be sure the results are as expected.
```{r}
# `BS_KE56MMY0 ARHGEF2--NTRK1` one does and will be captured in the `three_prime_kinase_inframe` list, but `BS_B1C6GZ84 CHIC2--PDGFRA2` does not retain the kinase domain, so we do not want to add it.
intersect(three_prime_kinase_outframe[,c("Sample", "FusionName")], three_prime_kinase_inframe[,c("Sample", "FusionName")])
```

```{r}
# Which fusions are not in-frame?
three_prime_kinase_outframe <- allFuseLGAT %>%
filter(grepl("Kinase", Gene1B_anno) & Fusion_Type != "in-frame") %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic question: If we have an in-frame fusion but the domain is not retained, we aren't counting it as in frame? So are there fusions that are in-frame but not retaining the domain that will not be in either dataframes being set up here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added some checks to check these and commented on the code. You are correct, there are two fusions which are in-frame, but do not retain the kinase domain. In case 1, there is another in-frame fusion with the kinase domain, so it is being captured in the three_prime_kinase_inframe list, but in case 2, the in-frame fusion does not retain the kinase domain and so we are not keeping.

distinct()
# Check they don't retain the kinase domain. They do not, so leave out.
table(three_prime_kinase_outframe$DomainRetainedGene1B)
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
```

Next, filter all fusions for 5' kinase fusions which have reciprocal fusions and retain the kinase domain.
Keep these for the final output file.
```{r}
# Keep those with kinase domain retained and fusion in-frame - keep this list
five_prime_domain_intact <- allFuseLGAT %>%
filter(grepl("Kinase", Gene1A_anno) & DomainRetainedGene1A == "Yes" & Fusion_Type == "in-frame") %>%
select(Sample, FusionName) %>%
distinct()
```

Next, filter all fusions for 5' kinase fusions which have lost the kinase domain for reciprocal fusions which have a kinase domain intact and are in-frame.
Keep these for the final output file.
```{r}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking this chunk can still be simplified. In particular, I'm noticing that five_prime_domain_lost is from allFuseLGAT and five_prime_reciprocals is from five_prime_domain_lost but then we are adding back info from allFuseLGAT.

So basically both five_prime_domain_lost and five_prime_reciprocals are based on allFuseLGAT but we have extra manipulation steps. I think we can do this more directly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I'm trying to make this shorter, I'm a bit confused on these steps.

I tried to boil it down to what looks like the essential steps to get five_prime_kinase_keep, but as of now it does not get the same result. I think part of this is I still don't really understand all that is happening in this chunk. Can you take a look at my attempt as well evaluating your original code to see how we can make these steps more efficient and clear?

In particular I know the differences have to do with the paired up "select() then distinct()" steps that are done a few timees, but its unclear to me why each of these "select() then distinct()" steps are necessary.

five_prime_kinase_keep <- allFuseLGAT %>%
  # Make reciprocal variable
  mutate(reciprocal = paste(Gene1B, Gene1A, sep ="--")) %>%
  filter(# Keep the in-frame reciprocals which have the kinase domain in tact.
        grepl("Kinase", Gene1A_anno) & reciprocal_exists == "TRUE" & DomainRetainedGene1A == "No" |
        # Also keep select the in-frame reciprocals which have the kinase domain in tact.
        Fusion_Type == "in-frame" & DomainRetainedGene1B == "Yes") %>% 
  # Select to which variables we want here
  select(Sample, FusionName = reciprocal) %>% 
  distinct()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh

        grepl("Kinase", Gene1A_anno) & reciprocal_exists == "TRUE" & DomainRetainedGene1A == "No" |
        # Also keep select the in-frame reciprocals which have the kinase domain in tact.
        Fusion_Type == "in-frame" & DomainRetainedGene1B == "Yes")

This filtering you have here is filtering on the frame-ness/domain retention of the 5' kinase fusions and not their reciprocals (3' kinase fusions), which is why I was making that variable, then re-joining five_prime_reciprocals with the allFuseLGAT.

What I am trying to do is find any 5' kinases without the domain retained, but then go back to that original dataframe allFuseLGAT and collect the annotations of their reciprocal fusions (ie 3' kinase fusions' frameness and domain retention), assess whether those are in-frame and retain the kinase domain, then if there is a reciprocal that satisfies that, join both the 5' kinase fusion dataframe (domain lost) and 3' kinase fusion dataframe (inframe/domain retained).

Actually, I can remove:

# Add back the reciprocal of the reciprocal (the 5' kinase fusion) and keep this list
five_prime_kinase_keep <- five_prime_reciprocals %>%
  mutate(reciprocal = paste(Gene1B, Gene1A, sep ="--")) %>%
  select(Sample, FusionName = reciprocal) %>%
  distinct()

and use:

# Then, select the in-frame reciprocals which have the kinase domain in tact. Retain 5' kinase fusion information and update 3' fusion column name to FusionName for merging with allFuseLGAT.
five_prime_kinase_keep <- five_prime_domain_lost %>% 
  select(Sample, five_prime_kinase = FusionName, FusionName = three_prime_kinase) %>%
  left_join(allFuseLGAT, by = c("Sample", "FusionName")) %>%
  filter(Fusion_Type == "in-frame" & DomainRetainedGene1B == "Yes") %>%
  select(Sample, FusionName = five_prime_kinase) %>%
  distinct()

distinct() is needed because in some cases, there are multiple breakpoints for each fusion and sometimes multiple can have the same frame-ness, so this reduces them to one fusion call per sample.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with commit - @cansavvy let me know if this clarifies!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! Thanks for those updates! I think this looks good!

# First, get fusions with 5' kinases which lost the kinase domain and have a reciprocal, then add the reciprocal fusion.
five_prime_domain_lost <- allFuseLGAT %>%
filter(grepl("Kinase", Gene1A_anno) & reciprocal_exists == "TRUE") %>%
jharenza marked this conversation as resolved.
Show resolved Hide resolved
select(Sample, FusionName, Gene1A, Gene1B, Fusion_Type, DomainRetainedGene1A) %>%
filter(DomainRetainedGene1A == "No") %>%
distinct() %>%
mutate(three_prime_kinase = paste(Gene1B, Gene1A, sep ="--"))

# Then, select the in-frame reciprocals which have the kinase domain in tact. Retain 5' kinase fusion information and update 3' fusion column name to FusionName for merging with allFuseLGAT.
five_prime_kinase_keep <- five_prime_domain_lost %>%
select(Sample, five_prime_kinase = FusionName, FusionName = three_prime_kinase) %>%
left_join(allFuseLGAT, by = c("Sample", "FusionName")) %>%
filter(Fusion_Type == "in-frame" & DomainRetainedGene1B == "Yes") %>%
select(Sample, FusionName = five_prime_kinase) %>%
distinct()
```

Rbind lists for final table of LGAT fusions of interest
```{r}
# Rbind lists for final table of LGAT fusions of interest
subsetFuseLGAT <- bind_rows(nonkinaseLGAT,
three_prime_kinase_inframe,
five_prime_domain_intact,
five_prime_kinase_keep) %>%
select(Sample, FusionName) %>%
jharenza marked this conversation as resolved.
Show resolved Hide resolved
distinct()
```

#### Write LGAT fusions to file

```{r}
subsetFuseLGAT <- subsetFuseLGAT %>%
prepareOutput(specimensUnion)

# Which fusions of interest are not present in any samples?
missingLgatFusion <- setdiff(lgatFuses, colnames(subsetFuseLGAT)[-1])
# For the fusions that are not present, fill those columns with 0
subsetFuseLGAT[, missingLgatFusion] <- 0

subsetFuseLGAT %>%
write_tsv(lgatFile)
```

## Session Info

```{r}
sessionInfo()
```

Loading