AlexsLemonade · jaclyn-taroni · Jan 31, 2020 · Nov 21, 2019 · Nov 21, 2019 · Jan 24, 2020
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -141,6 +141,11 @@ jobs:
           name: Molecular subtyping - Non-MB/Non-ATRT Embryonal tumors 
           command: OPENPBTA_SUBSET=0 ./scripts/run_in_ci.sh bash analyses/molecular-subtyping-embryonal/run-embryonal-subtyping.sh
 
+      - run:
+          name: Molecular subtyping Chordoma
+          command: ./scripts/run_in_ci.sh Rscript -e "rmarkdown::render('analyses/molecular-subtyping-chordoma/01-Subtype-chordoma.Rmd', clean = TRUE)"
+
+
          ################################
          #### Add your analysis here ####
          ################################

diff --git a/analyses/README.md b/analyses/README.md
@@ -26,6 +26,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
 | [`independent-samples`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/independent-samples) | `pbta-histologies.tsv` | Generates independent specimen lists for WGS/WXS samples | `results/independent-specimens.wgs.primary.tsv` <br> `results/independent-specimens.wgs.primary-plus.tsv` <br> `results/independent-specimens.wgswxs.primary.tsv` <br> `results/independent-specimens.wgswxs.primary-plus.tsv` (included in data download)
 | [`interaction-plots`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/interaction-plots) | `independent-specimens.wgs.primary-plus.tsv` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` | Creates interaction plots for mutation mutual exclusivity/co-occurrence [#13](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/13); may be updated to include other data types (e.g., fusions) | N/A
 | [`molecular-subtyping-ATRT`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-ATRT) | `analyses/gene-set-enrichment-analysis/results/gsva_scores_stranded.tsv` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `analyses/focal-cn-file-preparation/results/cnvkit_annotated_cn_autosomes.tsv.bz2` <br> `pbta-snv-consensus-mutation-tmb-all.tsv`  <br>  `pbta-cnv-cnvkit-gistic.zip` | *In progress*; summarizing data into tabular format in order to molecularly subtype ATRT samples [#244](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/244) | N/A
+| [`molecular-subtyping-chordoma`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-chordoma) | `analyses/focal-cn-file-preparation/results/cnvkit_annotated_cn_autosomes.tsv.gz` ([`fa21429`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/fa214291713575be7fd20c92374b268870f4173f)) <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` | *In progress*; identifying poorly-differentiated chordoma samples per [#250](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/250) | N/A
 | [`molecular-subtyping-embryonal`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-embryonal) | `fusion_summary_embryonal_foi.tsv` <br>  `pbta-histologies.tsv` <br> `analyses/focal-cn-file-preparation/cnvkit_annotated_cn_autosomes.tsv.bz2` <br> `analyses/focal-cn-file-preparation/cnvkit_annotated_cn_x_and_y.tsv.bz2` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` | *In progress*; molecular subtyping of non-medulloblastoma, non-ATRT embryonal tumors [#251](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/251) | N/A
 | [`molecular-subtyping-HGG`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-HGG) | `pbta-snv-consensus-mutation.maf.tsv.gz` <br> `analyses/focal-cn-preparation/results/cnvkit_annotated_cn_autosomes.tsv.bz2` <br> `pbta-fusion-putative-oncogenic.tsv` <br> `pbta-cnv-cnvkit-gistic.zip` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` | *In progress*; molecular subtyping of high-grade glioma samples [#249](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/249) | N/A
 | [`molecular-subtyping-SHH-tp53`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-SHH-tp53) | `pbta-histologies` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` | Identify the SHH-classified medulloblastoma samples that have TP53 mutations | N/A

diff --git a/analyses/molecular-subtyping-chordoma/01-Subtype-chordoma.Rmd b/analyses/molecular-subtyping-chordoma/01-Subtype-chordoma.Rmd
@@ -0,0 +1,231 @@
+---
+title: "Subtyping chordoma"
+output: html_notebook
+author: Mateusz Koptyra 
+date: 20191121
+---
+
+This notebook prepares _SMARCB1_ copy number and expression data for chordoma samples for the purpose of identifying poorly-differentiated chordoma samples, which are characterized by loss of _SMARCB1_.
+
+## Set up
+
+```{r}
+library(dplyr)
+library(readr)
+library(ggplot2)
+```
+
+### Read in data
+
+```{r}
+histologies_df <- read_tsv(file.path("..", "..", "data", 
+                                     "pbta-histologies.tsv"))
+```
+```{r}
+chordoma_samples <- histologies_df %>%
+  filter(short_histology == "Chordoma") %>% 
+  pull(Kids_First_Biospecimen_ID)
+```
+
+```{r}
+# TODO: update to use consensus file and likely a more permissive version of 
+# the focal-cn-file-preparation annotation step
+# Here, we're using an older version of the annotated files that used exons
+focal_cn_df <- read_tsv("https://github.com/AlexsLemonade/OpenPBTA-analysis/raw/fa214291713575be7fd20c92374b268870f4173f/analyses/focal-cn-file-preparation/results/cnvkit_annotated_cn_autosomes.tsv.gz")
+```
+
+```{r}
+#we need to include the sample_id field from pbta-histologies.tsv in the final table (field will allow #us to map between RNA-seq (e.g., SMARCB1 expression values) and WGS data (e.g., SMARCB1 focal copy #number status) from the same event for a given individual).
+#To get the SMARCB1 jitter plot in the photo here #250 (comment), you will first need to read in the #collapsed expression data
+expression_data <- read_rds(file.path("..", "..", "data", "pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds"))
+```
+
+### Output
+
+```{r}
+# scatterplot
+plot_dir <- "plots"
+if (!dir.exists(plot_dir)) {
+  dir.create(plot_dir)
+}
+plot_file <- file.path(plot_dir, "smarcb1_expression_copy_status.png")
+
+# tabular data
+results_dir <- "results"
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}
+output_file <- file.path(results_dir, "chordoma_smarcb1_status.tsv")
+```
+
+## Prepare the data
+
+```{r}
+chordoma_loss <- focal_cn_df %>% 
+  filter(biospecimen_id %in% chordoma_samples, gene_symbol == "SMARCB1")
+
+chordoma_loss 
+```
+
+```{r}
+chordoma_id_df <- histologies_df %>% 
+  # only rows with chordoma samples
+  filter(short_histology == "Chordoma") %>%
+  # select only these columns that we'll need later
+  select(Kids_First_Biospecimen_ID, sample_id, Kids_First_Participant_ID,
+         experimental_strategy)
+chordoma_id_df
+```
+
+```{r}
+copy_neutral_df <- chordoma_id_df %>% 
+  # the copy events can only be taken from WGS data not RNA-seq data
+  # we also only want biospecimens where a loss was not recorded to avoid duplicates
+  filter(experimental_strategy == "WGS",
+         !(Kids_First_Biospecimen_ID %in% chordoma_loss$biospecimen_id)) %>%
+  # if there's no loss, let's assume status is copy neutral
+  mutate(status = "neutral") %>%
+  # let's get the columns to match chordoma_loss
+  rename(biospecimen_id = Kids_First_Biospecimen_ID) %>%
+  select(biospecimen_id, status)
+copy_neutral_df
+```
+
+```{r}
+# remove large copy number data frame
+rm(focal_cn_df)
+```
+
+```{r}
+chordoma_copy <- chordoma_loss %>% 
+  #join the losses with the neutrals to get a new data frame
+  select(biospecimen_id, status) %>%
+  bind_rows(copy_neutral_df)
+chordoma_copy
+```
+
+Need to get the sample_id that corresponds to biospecimen_id into chordoma_copy so we can match WGS and RNA-seq biospecimens from the same event/sample:
+```{r}
+chordoma_copy <- chordoma_copy %>%
+  # get only the Kids_First_Biospecimen_ID, sample_id columns from our identifier data.frame
+  # then use biospecimen IDs to add the sample_id info
+  inner_join(select(chordoma_id_df,
+                    Kids_First_Biospecimen_ID,
+                    sample_id),
+             by = c("biospecimen_id" = "Kids_First_Biospecimen_ID"))
+chordoma_copy
+```
+
+Look at SMARCB1 expression values only in chordoma
+
+```{r}
+# get the row that contains the SMARCB1 values
+# gene symbols are rownames
+smarcb1_expression <- expression_data[which(rownames(expression_data) == "SMARCB1"), ]
+
+# now only the columns correspond to chordoma samples
+smarcb1_expression <- smarcb1_expression[, which(colnames(expression_data) %in% chordoma_samples) ]
+smarcb1_expression
+```
+
+```{r}
+# remove large expression matrix that's no longer needed
+rm(expression_data)
+```
+
+The `smarcb1_expression` is a not a friendly form ^^; Transposing needed: 
+
+```{r}
+# transpose such that samples are rows
+smarcb1_expression <- t(smarcb1_expression) %>%
+  # make a data.frame
+  as.data.frame() %>%
+  # we want the rownames that are biospecimen identifers as their own column called Kids_First_Biospecimen_ID
+  tibble::rownames_to_column("Kids_First_Biospecimen_ID") %>%
+  # give SMARCB1 column a slightly better column name
+  rename(SMARCB1_expression = SMARCB1)
+smarcb1_expression
+```
+This also needs sample_id to add it in:
+
+```{r}
+smarcb1_expression <- smarcb1_expression %>%
+  inner_join(select(chordoma_id_df,
+                    Kids_First_Biospecimen_ID,
+                    sample_id),
+             by = "Kids_First_Biospecimen_ID")
+smarcb1_expression
+```
+
+```{r}
+chordoma_smarcb1_df <- smarcb1_expression %>%
+  # any missing samples will get filled with NA when using a full join
+  full_join(chordoma_copy, by = "sample_id")
+chordoma_smarcb1_df
+```
+
+```{r}
+# this step adds in the participant identifier (sample_id to match between the two data.frame)
+chordoma_smarcb1_df <- chordoma_smarcb1_df %>%
+  inner_join(distinct(select(chordoma_id_df, 
+                             sample_id, 
+                             Kids_First_Participant_ID)),
+             by = "sample_id")
+
+# combining the two biospecimen identifiers to a single column (all biospecimen IDs for a sampl separated by a comma)
+chordoma_smarcb1_df <- chordoma_smarcb1_df %>%
+  mutate(Kids_First_Biospecimen_ID = if_else(
+    # one sample is missing WGS data, so if that's true only include biospecimen ID from RNA-seq
+    is.na(biospecimen_id),
+    Kids_First_Biospecimen_ID,
+    paste(Kids_First_Biospecimen_ID, biospecimen_id, sep = ", ")
+  ))
+chordoma_smarcb1_df
+```
+
+```{r}
+chordoma_smarcb1_df <- chordoma_smarcb1_df %>%
+  select(Kids_First_Participant_ID, 
+         Kids_First_Biospecimen_ID, 
+         sample_id,
+         status,
+         SMARCB1_expression) %>%
+  # 'status' is replaced a more descriptive name
+  rename(focal_SMARCB1_status = status)
+chordoma_smarcb1_df
+```
+
+### Plot _SMARCB1_ expression
+
+copy loss vs. copy neutral 
+
+```{r}
+# this specifies that this is the data we want to plot
+chordoma_smarcb1_df %>%
+  # drop the sample that doesn't have WGS data
+  tidyr::drop_na() %>%
+  # this step specifies what should go on the x- and y-axes
+  ggplot(aes(x = focal_SMARCB1_status,
+             y = SMARCB1_expression)) +
+  # we want a jitter plot where the points aren't too far 
+  # apart that's what width does
+  geom_jitter(width = 0.1) +
+  # this is plotting the median as a blue diamond
+  stat_summary(fun.y = "median", 
+               geom = "point", 
+               size = 3, 
+               color = "blue", 
+               shape = 18)
+```
+
+```{r}
+ggsave(filename = plot_file)
+```
+
+Write the table to file.
+
+```{r}
+chordoma_smarcb1_df %>%
+  write_tsv(output_file)
+```
+
diff --git a/analyses/molecular-subtyping-chordoma/01-Subtype-chordoma.nb.html b/analyses/molecular-subtyping-chordoma/01-Subtype-chordoma.nb.html
diff --git a/analyses/molecular-subtyping-chordoma/README.md b/analyses/molecular-subtyping-chordoma/README.md
@@ -0,0 +1,15 @@
+## Molecular subtyping of chordomas
+
+**Module authors:** Mateusz Koptyra ([@mkoptyra](https://github.com/mkoptyra))
+
+This module consists of a single notebook that looks at _SMARCB1_ focal copy status and expression levels.
+It can be run via the command line with the following:
+
+```
+Rscript -e "rmarkdown::render('01-Subtype-chordoma.Rmd', clean = TRUE)"
+```
+
+### Notes on copy status
+
+This notebook uses an older version of annotated CNVkit file from the `focal-cn-file-preparation` ([`fa21429`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/fa214291713575be7fd20c92374b268870f4173f)) as the current version of the annotated file from CNVkit may be too restrictive (see: [#473](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/473)). 
+This will need to be updated to use the copy number consensus file as well.
diff --git a/analyses/molecular-subtyping-chordoma/plots/smarcb1_expression_copy_status.png b/analyses/molecular-subtyping-chordoma/plots/smarcb1_expression_copy_status.png
diff --git a/analyses/molecular-subtyping-chordoma/results/chordoma_smarcb1_status.tsv b/analyses/molecular-subtyping-chordoma/results/chordoma_smarcb1_status.tsv
@@ -0,0 +1,11 @@
+Kids_First_Participant_ID	Kids_First_Biospecimen_ID	sample_id	focal_SMARCB1_status	SMARCB1_expression
+PT_3WA7SBQ6	BS_0VRSD9V3, BS_HN8DE43A	7316-2248	neutral	113.08
+PT_41BCPB7R	BS_64HCD9K3	7316-406	NA	33.73
+PT_WTMSD2WB	BS_67PX06P3, BS_JTBM5TSE	7316-3632	loss	15.68
+PT_HFQNKP5X	BS_DBDXCXT5, BS_5B6XZ7YP	7316-4062	loss	18.66
+PT_HFQNKP5X	BS_EJ9JKM1C, BS_6F49F7WH	7316-3295	loss	3.89
+PT_HFQNKP5X	BS_GFM6EA61, BS_FBJ516WW	7316-921	neutral	20.87
+PT_YZ8A8A36	BS_GSVCN2XC, BS_59FR1NC2	7316-723	neutral	43.79
+PT_F1086Z0A	BS_P4D8Y3S8, BS_BWZTMWTM	7316-1101	loss	7.1
+PT_7TRGHZBK	BS_W36RZSFA, BS_XEVMEYFS	7316-431	neutral	53.15
+PT_HFQNKP5X	BS_YB07VF1X, BS_9GN1QA3Q	7316-2935	loss	14.29