Proposed Analysis: Molecularly subtype chordomas #250

jharenza · 2019-11-08T17:31:35Z

Scientific goals

What are the scientific goals of the analysis?
Subtype chordomas into poorly differentiated if contain INI1/SMARCB1 loss.

Proposed methods

What methods do you plan to use to accomplish the scientific goals?

Review of mutation, copy number, and expression data.
Render results in tabular form in a notebook.

Add Poorly-differentiated subtype if the chordomas show SMARCB1 loss and loss of expression.

Required input data

What input data will you use for this analysis?
Histologies file, SNVs, copy number, RNA expression data.

Proposed timeline

What is the timeline for the analysis?
1-2 days

Relevant literature

If there is relevant scientific literature, put links to those items here.
Link to Loss of SMARCB1/INI1 expression in poorly differentiated chordomas.
Link to Poorly differentiated chordoma with loss of SMARCB1/INI1 expression in pediatric patients: A report of two cases and review of the literature.

jaclyn-taroni · 2019-11-10T22:14:51Z

This is what I think the table summarizing the results would look like based on the above:

Kids First Participant ID	Kids First Biospecimen ID	Focal SMARCB1 Status	SMARCB1 Gene expression abundance
PT_XXXXXXXX	BS_XXXXXXXX	loss	...
...	...	...	...

Focal CN information summarized to the gene level is available in aanalyses/focal-cn-file-preparation/results
For verifying the loss via gene expression data, an analyst could compare a Chordoma sample's SMARCB1 abundance to that of samples (of any disease type) that show no evidence of SMARCB1 loss. Some relevant work in progress over on Focal CN RNA expression #240.

jaclyn-taroni · 2019-11-13T00:35:54Z

@mkoptyra and I will start on this next week.

mkoptyra · 2019-11-21T14:12:32Z

Brachyury / T / TBXT protein expression is a standard test clinically to diagnose chordoma.
High gene expression of TBXT gene can be a general chordoma indicator.

jaclyn-taroni · 2019-11-21T16:30:09Z

Documenting a whiteboarding session with @mkoptyra from this morning:

I will post my thoughts on the SMARCB1 expression jitterplot at the bottom of this photo later today.

jaclyn-taroni · 2019-11-22T19:28:27Z

Figured out my source of confusion from yesterday morning @mkoptyra 🙂 #248 (comment)

We might want to wait until the next data release to proceed with the SMARCB1 jitter plot because then we can more easily work with the expression data for this purpose, but I'll give it a bit of a think.

mkoptyra · 2019-11-22T21:44:39Z

Thank you! That makes sense reading the #248 (imagine, that I even understood that conversation ;) )

jaclyn-taroni · 2019-12-05T14:28:01Z

Okay, I apologize for the delay @mkoptyra. We got v11 out this week so this is ready to be revisited. There's a greater level of wrangling required than I fully appreciated!

In addition to the fields outlined here #250 (comment), we need to include the sample_id field from pbta-histologies.tsv in the final table. This is because that field will allow us to map between RNA-seq (e.g., SMARCB1 expression values) and WGS data (e.g., SMARCB1 focal copy number status) from the same event for a given individual.

To get the SMARCB1 jitter plot in the photo here #250 (comment), you will first need to read in the collapsed expression data:

expression_data <- read_rds("../../data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds")

or, preferably, using file.path to handle the file path separators for us:

expression_data <- read_rds(file.path("..", "..", "data", "pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds"))

In what you have so far in your notebook, you have a data frame called chordoma_loss that contains information about SMARCB1 loss (see this chunk: https://github.com/mkoptyra/OpenPBTA-analysis/blob/83d08f755d8a5100df94bd980647f48fbdd4c518/analyses/subtype-chordoma/01-Subtype-chordoma.Rmd#L26).

Side note: We are in the process of getting copy number consensus calls (#128) through review so the copy number file input may change.

chordoma_loss looks like:

> chordoma_loss
# A tibble: 4 x 7
  biospecimen_id status copy_number ploidy ensembl         gene_symbol cytoband
  <chr>          <chr>        <dbl>  <dbl> <chr>           <chr>       <chr>   
1 BS_6F49F7WH    loss             1      3 ENSG00000099956 SMARCB1     22q11.23
2 BS_9GN1QA3Q    loss             1      2 ENSG00000099956 SMARCB1     22q11.23
3 BS_BWZTMWTM    loss             1      2 ENSG00000099956 SMARCB1     22q11.23
4 BS_JTBM5TSE    loss             3      4 ENSG00000099956 SMARCB1     22q11.23

Any copy neutral events are removed from the file that you're using. So to make the jitter plot:

We need to make a data frame that represents the copy neutral events.
We need to use the sample_id field to match the same sample's RNA-seq and WGS data.

So first thing, let's get a data frame that contains the identifier information we need but limited to chordoma samples:

chordoma_id_df <- histologies_df %>% 
  # only rows with chordoma samples
  filter(short_histology == "Chordoma") %>%
  # select only these columns that we'll need later
  select(Kids_First_Biospecimen_ID, sample_id, Kids_First_Participant_ID,
         experimental_strategy)

and then to get the copy neutral data frame described above:

copy_neutral_df <- chordoma_id_df %>% 
  # the copy events can only be taken from WGS data
  # not RNA-seq data
  # we also only want biospecimens where a loss was
  # not recorded to avoid duplicates
  filter(experimental_strategy == "WGS",
         !(Kids_First_Biospecimen_ID %in% chordoma_loss$biospecimen_id)) %>%
  # if there's no loss, let's assume status is copy
  # neutral
  mutate(status = "neutral") %>%
  # let's get the columns to match chordoma_loss
  rename(biospecimen_id = Kids_First_Biospecimen_ID) %>%
  select(biospecimen_id, status)

which looks like:

> copy_neutral_df
# A tibble: 5 x 2
  biospecimen_id status 
  <chr>          <chr>  
1 BS_59FR1NC2    neutral
2 BS_5B6XZ7YP    neutral
3 BS_FBJ516WW    neutral
4 BS_HN8DE43A    neutral
5 BS_XEVMEYFS    neutral

Now it's time to join the losses with the neutrals to get a new data frame:

chordoma_copy <- chordoma_loss %>% 
  select(biospecimen_id, status) %>%
  bind_rows(copy_neutral_df)

And we need to get the sample_id that corresponds to biospecimen_id into chordoma_copy so we can match WGS and RNA-seq biospecimens from the same event/sample:

chordoma_copy <- chordoma_copy %>%
  # get only the Kids_First_Biospecimen_ID, sample_id
  # columns from our identifier data.frame
  # then use biospecimen IDs to add the sample_id info
  inner_join(select(chordoma_id_df,
                    Kids_First_Biospecimen_ID,
                    sample_id),
             by = c("biospecimen_id" = "Kids_First_Biospecimen_ID"))

The result looks like:

> chordoma_copy
# A tibble: 9 x 3
  biospecimen_id status  sample_id
  <chr>          <chr>   <chr>    
1 BS_6F49F7WH    loss    7316-3295
2 BS_9GN1QA3Q    loss    7316-2935
3 BS_BWZTMWTM    loss    7316-1101
4 BS_JTBM5TSE    loss    7316-3632
5 BS_59FR1NC2    neutral 7316-723 
6 BS_5B6XZ7YP    neutral 7316-4062
7 BS_FBJ516WW    neutral 7316-921 
8 BS_HN8DE43A    neutral 7316-2248
9 BS_XEVMEYFS    neutral 7316-431

Okay, now we're ready to look at SMARCB1 expression values only in chordoma:

# get the row that contains the SMARCB1 values
# gene symbols are rownames
smarcb1_expression <- expression_data[which(rownames(expression_data) == "SMARCB1"), ]

# now only the columns correspond to chordoma samples
smarcb1_expression <- smarcb1_expression[, which(colnames(expression_data) %in% chordoma_samples) ]

And then smarcb1_expression is a form we don't want to work with (1 row, each column is a chordoma sample):

> str(smarcb1_expression)
'data.frame':	1 obs. of  10 variables:
 $ BS_0VRSD9V3: num 113
 $ BS_64HCD9K3: num 33.7
 $ BS_67PX06P3: num 15.7
 $ BS_DBDXCXT5: num 18.7
 $ BS_EJ9JKM1C: num 3.89
 $ BS_GFM6EA61: num 20.9
 $ BS_GSVCN2XC: num 43.8
 $ BS_P4D8Y3S8: num 7.1
 $ BS_W36RZSFA: num 53.1
 $ BS_YB07VF1X: num 14.3

So we can do the following to get it close to the final table we want and to get this ready for plotting:

# transpose such that samples are rows
smarcb1_expression <- t(smarcb1_expression) %>%
  # make a data.frame
  as.data.frame() %>%
  # we want the rownames that are biospecimen identifers
  # as their own column called Kids_First_Biospecimen_ID
  tibble::rownames_to_column("Kids_First_Biospecimen_ID") %>%
  # give SMARCB1 column a slightly better column name
  rename(SMARCB1_expression = SMARCB1)

That step gives us this:

> smarcb1_expression
   Kids_First_Biospecimen_ID SMARCB1_expression
1                BS_0VRSD9V3             113.08
2                BS_64HCD9K3              33.73
3                BS_67PX06P3              15.68
4                BS_DBDXCXT5              18.66
5                BS_EJ9JKM1C               3.89
6                BS_GFM6EA61              20.87
7                BS_GSVCN2XC              43.79
8                BS_P4D8Y3S8               7.10
9                BS_W36RZSFA              53.15
10               BS_YB07VF1X              14.29

This also needs sample_id, so similar to above we add it in:

smarcb1_expression <- smarcb1_expression %>%
  inner_join(select(chordoma_id_df,
                    Kids_First_Biospecimen_ID,
                    sample_id),
             by = "Kids_First_Biospecimen_ID")

Now we're ready to get the copy number information and the expression values into the same data frame using sample_id:

chordoma_smarcb1_df <- smarcb1_expression %>%
  # any missing samples will get filled with NA
  # when using a full join
  full_join(chordoma_copy, by = "sample_id")

To get:

> str(chordoma_smarcb1_df)
'data.frame':	10 obs. of  5 variables:
 $ Kids_First_Biospecimen_ID: chr  "BS_0VRSD9V3" "BS_64HCD9K3" "BS_67PX06P3" "BS_DBDXCXT5" ...
 $ SMARCB1_expression       : num  113.08 33.73 15.68 18.66 3.89 ...
 $ sample_id                : chr  "7316-2248" "7316-406" "7316-3632" "7316-4062" ...
 $ biospecimen_id           : chr  "BS_HN8DE43A" NA "BS_JTBM5TSE" "BS_5B6XZ7YP" ...
 $ status                   : chr  "neutral" NA "loss" "neutral" ...

All of the information we need to look at SMARCB1 is in a single data frame, where each row is a chordoma sample.
Notice that there are two columns that contain biospecimen identifiers: one from the WGS data and one from the RNA-seq data.

Now to get a bit closer to what is in this comment #250 (comment), we can add the participant ID in using the data frame that contains the identifiers.

# this step is what adds in the participant identifier
# we use the sample_id to match between the two data.frame
chordoma_smarcb1_df <- chordoma_smarcb1_df %>%
  inner_join(distinct(select(chordoma_id_df, 
                             sample_id, 
                             Kids_First_Participant_ID)),
             by = "sample_id")

# here we're combining the two biospecimen identifiers to
# a single column where all biospecimen IDs for a sample
# will be separated by a comma
chordoma_smarcb1_df <- chordoma_smarcb1_df %>%
  mutate(Kids_First_Biospecimen_ID = if_else(
    # one sample is missing WGS data, so if that's true
    # only include biospecimen ID from RNA-seq
    is.na(biospecimen_id),
    Kids_First_Biospecimen_ID,
    paste(Kids_First_Biospecimen_ID, biospecimen_id, sep = ", ")
  ))

And we can drop the redundant biospecimen ID column and rearrange the data frame such that the identifiers come first, followed by focal SMARCB1 status and SMARCB1 expression values:

chordoma_smarcb1_df <- chordoma_smarcb1_df %>%
  select(Kids_First_Participant_ID, 
         Kids_First_Biospecimen_ID, 
         sample_id,
         status,
         SMARCB1_expression) %>%
  # we'll give 'status' a more descriptive name
  rename(focal_SMARCB1_status = status)

chordoma_smarcb1_df is ready to get written to file!

Now for the plot -- first step is putting library(ggplot2) at the top with https://github.com/mkoptyra/OpenPBTA-analysis/blob/83d08f755d8a5100df94bd980647f48fbdd4c518/analyses/subtype-chordoma/01-Subtype-chordoma.Rmd#L11.

chordoma_smarcb1_df is in perfect shape for us to make the plot. Here's one way to do it:

# this specifies that this is the data we want to plot
chordoma_smarcb1_df %>%
  # drop the sample that doesn't have WGS data
  tidyr::drop_na() %>%
  # this step specifies what should go on the x- and y-axes
  ggplot(aes(x = focal_SMARCB1_status,
             y = SMARCB1_expression)) +
  # we want a jitter plot where the points aren't too far 
  # apart that's what width does
  geom_jitter(width = 0.1) +
  # this is plotting the median as a blue diamond
  stat_summary(fun.y = "median", 
               geom = "point", 
               size = 3, 
               color = "blue", 
               shape = 18)

There are lots of ways to customize the plot to your liking. One way is to change the theme (see ggtheme documentation), but you also can control the labels (check this resource out).

Let me know if you have any questions @mkoptyra :)

jaclyn-taroni · 2020-02-02T17:40:01Z

Per this discussion https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/475/files#r371877081 - what's left here is to evaluate the poorly differentiated calls in light of how the expression levels look overall (#387) and to have a domain expert weigh in.

jaclyn-taroni · 2020-03-09T18:28:51Z

I am going to call this subsumed by #608.

jharenza added the proposed analysis label Nov 8, 2019

jharenza mentioned this issue Nov 8, 2019

Planned Analysis: Molecularly subtype all tumors #19

Closed

7 tasks

jaclyn-taroni added cnv Related to or requires CNV data molecular subtyping Related to molecular subtyping of tumors snv Related to or requires SNV data transcriptomic Related to or requires transcriptomic data labels Nov 8, 2019

jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Nov 13, 2019

jaclyn-taroni mentioned this issue Dec 19, 2019

Proposed Analysis: Add additional subtypes to medulloblastoma tumors #247

Closed

mkoptyra mentioned this issue Jan 24, 2020

Subtype chordoma #475

Merged

5 tasks

jaclyn-taroni closed this as completed Mar 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Analysis: Molecularly subtype chordomas #250

Proposed Analysis: Molecularly subtype chordomas #250

jharenza commented Nov 8, 2019

jaclyn-taroni commented Nov 10, 2019

jaclyn-taroni commented Nov 13, 2019

mkoptyra commented Nov 21, 2019

jaclyn-taroni commented Nov 21, 2019

jaclyn-taroni commented Nov 22, 2019

mkoptyra commented Nov 22, 2019

jaclyn-taroni commented Dec 5, 2019 •

edited

Loading

jaclyn-taroni commented Feb 2, 2020

jaclyn-taroni commented Mar 9, 2020

Proposed Analysis: Molecularly subtype chordomas #250

Proposed Analysis: Molecularly subtype chordomas #250

Comments

jharenza commented Nov 8, 2019

Scientific goals

Proposed methods

Required input data

Proposed timeline

Relevant literature

jaclyn-taroni commented Nov 10, 2019

jaclyn-taroni commented Nov 13, 2019

mkoptyra commented Nov 21, 2019

jaclyn-taroni commented Nov 21, 2019

jaclyn-taroni commented Nov 22, 2019

mkoptyra commented Nov 22, 2019

jaclyn-taroni commented Dec 5, 2019 • edited Loading

jaclyn-taroni commented Feb 2, 2020

jaclyn-taroni commented Mar 9, 2020

jaclyn-taroni commented Dec 5, 2019 •

edited

Loading