Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Proposed Analysis: Molecularly subtype chordomas #250

Closed
jharenza opened this issue Nov 8, 2019 · 9 comments
Closed

Proposed Analysis: Molecularly subtype chordomas #250

jharenza opened this issue Nov 8, 2019 · 9 comments
Labels
cnv Related to or requires CNV data in progress Someone is working on this issue, but feel free to propose an alternative approach! molecular subtyping Related to molecular subtyping of tumors proposed analysis snv Related to or requires SNV data transcriptomic Related to or requires transcriptomic data

Comments

@jharenza
Copy link
Collaborator

jharenza commented Nov 8, 2019

Scientific goals

What are the scientific goals of the analysis?
Subtype chordomas into poorly differentiated if contain INI1/SMARCB1 loss.

Proposed methods

What methods do you plan to use to accomplish the scientific goals?

  1. Review of mutation, copy number, and expression data.
  2. Render results in tabular form in a notebook.

Add Poorly-differentiated subtype if the chordomas show SMARCB1 loss and loss of expression.

Required input data

What input data will you use for this analysis?
Histologies file, SNVs, copy number, RNA expression data.

Proposed timeline

What is the timeline for the analysis?
1-2 days

Relevant literature

If there is relevant scientific literature, put links to those items here.
Link to Loss of SMARCB1/INI1 expression in poorly differentiated chordomas.
Link to Poorly differentiated chordoma with loss of SMARCB1/INI1 expression in pediatric patients: A report of two cases and review of the literature.

@jaclyn-taroni jaclyn-taroni added cnv Related to or requires CNV data molecular subtyping Related to molecular subtyping of tumors snv Related to or requires SNV data transcriptomic Related to or requires transcriptomic data labels Nov 8, 2019
@jaclyn-taroni
Copy link
Member

This is what I think the table summarizing the results would look like based on the above:

Kids First Participant ID Kids First Biospecimen ID Focal SMARCB1 Status SMARCB1 Gene expression abundance
PT_XXXXXXXX BS_XXXXXXXX loss ...
... ... ... ...
  • Focal CN information summarized to the gene level is available in aanalyses/focal-cn-file-preparation/results
  • For verifying the loss via gene expression data, an analyst could compare a Chordoma sample's SMARCB1 abundance to that of samples (of any disease type) that show no evidence of SMARCB1 loss. Some relevant work in progress over on Focal CN RNA expression  #240.

@jaclyn-taroni
Copy link
Member

@mkoptyra and I will start on this next week.

@jaclyn-taroni jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Nov 13, 2019
@mkoptyra
Copy link
Contributor

Brachyury / T / TBXT protein expression is a standard test clinically to diagnose chordoma.
High gene expression of TBXT gene can be a general chordoma indicator.

@jaclyn-taroni
Copy link
Member

Documenting a whiteboarding session with @mkoptyra from this morning:

Image from iOS (15)

I will post my thoughts on the SMARCB1 expression jitterplot at the bottom of this photo later today.

@jaclyn-taroni
Copy link
Member

Figured out my source of confusion from yesterday morning @mkoptyra 🙂 #248 (comment)

We might want to wait until the next data release to proceed with the SMARCB1 jitter plot because then we can more easily work with the expression data for this purpose, but I'll give it a bit of a think.

@mkoptyra
Copy link
Contributor

Thank you! That makes sense reading the #248 (imagine, that I even understood that conversation ;) )

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Dec 5, 2019

Okay, I apologize for the delay @mkoptyra. We got v11 out this week so this is ready to be revisited. There's a greater level of wrangling required than I fully appreciated!

In addition to the fields outlined here #250 (comment), we need to include the sample_id field from pbta-histologies.tsv in the final table. This is because that field will allow us to map between RNA-seq (e.g., SMARCB1 expression values) and WGS data (e.g., SMARCB1 focal copy number status) from the same event for a given individual.

To get the SMARCB1 jitter plot in the photo here #250 (comment), you will first need to read in the collapsed expression data:

expression_data <- read_rds("../../data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds")

or, preferably, using file.path to handle the file path separators for us:

expression_data <- read_rds(file.path("..", "..", "data", "pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds"))

In what you have so far in your notebook, you have a data frame called chordoma_loss that contains information about SMARCB1 loss (see this chunk: https://github.com/mkoptyra/OpenPBTA-analysis/blob/83d08f755d8a5100df94bd980647f48fbdd4c518/analyses/subtype-chordoma/01-Subtype-chordoma.Rmd#L26).

Side note: We are in the process of getting copy number consensus calls (#128) through review so the copy number file input may change.

chordoma_loss looks like:

> chordoma_loss
# A tibble: 4 x 7
  biospecimen_id status copy_number ploidy ensembl         gene_symbol cytoband
  <chr>          <chr>        <dbl>  <dbl> <chr>           <chr>       <chr>   
1 BS_6F49F7WH    loss             1      3 ENSG00000099956 SMARCB1     22q11.23
2 BS_9GN1QA3Q    loss             1      2 ENSG00000099956 SMARCB1     22q11.23
3 BS_BWZTMWTM    loss             1      2 ENSG00000099956 SMARCB1     22q11.23
4 BS_JTBM5TSE    loss             3      4 ENSG00000099956 SMARCB1     22q11.23

Any copy neutral events are removed from the file that you're using. So to make the jitter plot:

  • We need to make a data frame that represents the copy neutral events.
  • We need to use the sample_id field to match the same sample's RNA-seq and WGS data.

So first thing, let's get a data frame that contains the identifier information we need but limited to chordoma samples:

chordoma_id_df <- histologies_df %>% 
  # only rows with chordoma samples
  filter(short_histology == "Chordoma") %>%
  # select only these columns that we'll need later
  select(Kids_First_Biospecimen_ID, sample_id, Kids_First_Participant_ID,
         experimental_strategy)

and then to get the copy neutral data frame described above:

copy_neutral_df <- chordoma_id_df %>% 
  # the copy events can only be taken from WGS data
  # not RNA-seq data
  # we also only want biospecimens where a loss was
  # not recorded to avoid duplicates
  filter(experimental_strategy == "WGS",
         !(Kids_First_Biospecimen_ID %in% chordoma_loss$biospecimen_id)) %>%
  # if there's no loss, let's assume status is copy
  # neutral
  mutate(status = "neutral") %>%
  # let's get the columns to match chordoma_loss
  rename(biospecimen_id = Kids_First_Biospecimen_ID) %>%
  select(biospecimen_id, status)

which looks like:

> copy_neutral_df
# A tibble: 5 x 2
  biospecimen_id status 
  <chr>          <chr>  
1 BS_59FR1NC2    neutral
2 BS_5B6XZ7YP    neutral
3 BS_FBJ516WW    neutral
4 BS_HN8DE43A    neutral
5 BS_XEVMEYFS    neutral

Now it's time to join the losses with the neutrals to get a new data frame:

chordoma_copy <- chordoma_loss %>% 
  select(biospecimen_id, status) %>%
  bind_rows(copy_neutral_df)

And we need to get the sample_id that corresponds to biospecimen_id into chordoma_copy so we can match WGS and RNA-seq biospecimens from the same event/sample:

chordoma_copy <- chordoma_copy %>%
  # get only the Kids_First_Biospecimen_ID, sample_id
  # columns from our identifier data.frame
  # then use biospecimen IDs to add the sample_id info
  inner_join(select(chordoma_id_df,
                    Kids_First_Biospecimen_ID,
                    sample_id),
             by = c("biospecimen_id" = "Kids_First_Biospecimen_ID"))

The result looks like:

> chordoma_copy
# A tibble: 9 x 3
  biospecimen_id status  sample_id
  <chr>          <chr>   <chr>    
1 BS_6F49F7WH    loss    7316-3295
2 BS_9GN1QA3Q    loss    7316-2935
3 BS_BWZTMWTM    loss    7316-1101
4 BS_JTBM5TSE    loss    7316-3632
5 BS_59FR1NC2    neutral 7316-723 
6 BS_5B6XZ7YP    neutral 7316-4062
7 BS_FBJ516WW    neutral 7316-921 
8 BS_HN8DE43A    neutral 7316-2248
9 BS_XEVMEYFS    neutral 7316-431 

Okay, now we're ready to look at SMARCB1 expression values only in chordoma:

# get the row that contains the SMARCB1 values
# gene symbols are rownames
smarcb1_expression <- expression_data[which(rownames(expression_data) == "SMARCB1"), ]

# now only the columns correspond to chordoma samples
smarcb1_expression <- smarcb1_expression[, which(colnames(expression_data) %in% chordoma_samples) ]

And then smarcb1_expression is a form we don't want to work with (1 row, each column is a chordoma sample):

> str(smarcb1_expression)
'data.frame':	1 obs. of  10 variables:
 $ BS_0VRSD9V3: num 113
 $ BS_64HCD9K3: num 33.7
 $ BS_67PX06P3: num 15.7
 $ BS_DBDXCXT5: num 18.7
 $ BS_EJ9JKM1C: num 3.89
 $ BS_GFM6EA61: num 20.9
 $ BS_GSVCN2XC: num 43.8
 $ BS_P4D8Y3S8: num 7.1
 $ BS_W36RZSFA: num 53.1
 $ BS_YB07VF1X: num 14.3

So we can do the following to get it close to the final table we want and to get this ready for plotting:

# transpose such that samples are rows
smarcb1_expression <- t(smarcb1_expression) %>%
  # make a data.frame
  as.data.frame() %>%
  # we want the rownames that are biospecimen identifers
  # as their own column called Kids_First_Biospecimen_ID
  tibble::rownames_to_column("Kids_First_Biospecimen_ID") %>%
  # give SMARCB1 column a slightly better column name
  rename(SMARCB1_expression = SMARCB1)

That step gives us this:

> smarcb1_expression
   Kids_First_Biospecimen_ID SMARCB1_expression
1                BS_0VRSD9V3             113.08
2                BS_64HCD9K3              33.73
3                BS_67PX06P3              15.68
4                BS_DBDXCXT5              18.66
5                BS_EJ9JKM1C               3.89
6                BS_GFM6EA61              20.87
7                BS_GSVCN2XC              43.79
8                BS_P4D8Y3S8               7.10
9                BS_W36RZSFA              53.15
10               BS_YB07VF1X              14.29

This also needs sample_id, so similar to above we add it in:

smarcb1_expression <- smarcb1_expression %>%
  inner_join(select(chordoma_id_df,
                    Kids_First_Biospecimen_ID,
                    sample_id),
             by = "Kids_First_Biospecimen_ID")

Now we're ready to get the copy number information and the expression values into the same data frame using sample_id:

chordoma_smarcb1_df <- smarcb1_expression %>%
  # any missing samples will get filled with NA
  # when using a full join
  full_join(chordoma_copy, by = "sample_id")

To get:

> str(chordoma_smarcb1_df)
'data.frame':	10 obs. of  5 variables:
 $ Kids_First_Biospecimen_ID: chr  "BS_0VRSD9V3" "BS_64HCD9K3" "BS_67PX06P3" "BS_DBDXCXT5" ...
 $ SMARCB1_expression       : num  113.08 33.73 15.68 18.66 3.89 ...
 $ sample_id                : chr  "7316-2248" "7316-406" "7316-3632" "7316-4062" ...
 $ biospecimen_id           : chr  "BS_HN8DE43A" NA "BS_JTBM5TSE" "BS_5B6XZ7YP" ...
 $ status                   : chr  "neutral" NA "loss" "neutral" ...

All of the information we need to look at SMARCB1 is in a single data frame, where each row is a chordoma sample.
Notice that there are two columns that contain biospecimen identifiers: one from the WGS data and one from the RNA-seq data.

Now to get a bit closer to what is in this comment #250 (comment), we can add the participant ID in using the data frame that contains the identifiers.

# this step is what adds in the participant identifier
# we use the sample_id to match between the two data.frame
chordoma_smarcb1_df <- chordoma_smarcb1_df %>%
  inner_join(distinct(select(chordoma_id_df, 
                             sample_id, 
                             Kids_First_Participant_ID)),
             by = "sample_id")

# here we're combining the two biospecimen identifiers to
# a single column where all biospecimen IDs for a sample
# will be separated by a comma
chordoma_smarcb1_df <- chordoma_smarcb1_df %>%
  mutate(Kids_First_Biospecimen_ID = if_else(
    # one sample is missing WGS data, so if that's true
    # only include biospecimen ID from RNA-seq
    is.na(biospecimen_id),
    Kids_First_Biospecimen_ID,
    paste(Kids_First_Biospecimen_ID, biospecimen_id, sep = ", ")
  ))

And we can drop the redundant biospecimen ID column and rearrange the data frame such that the identifiers come first, followed by focal SMARCB1 status and SMARCB1 expression values:

chordoma_smarcb1_df <- chordoma_smarcb1_df %>%
  select(Kids_First_Participant_ID, 
         Kids_First_Biospecimen_ID, 
         sample_id,
         status,
         SMARCB1_expression) %>%
  # we'll give 'status' a more descriptive name
  rename(focal_SMARCB1_status = status)

chordoma_smarcb1_df is ready to get written to file!

Now for the plot -- first step is putting library(ggplot2) at the top with https://github.com/mkoptyra/OpenPBTA-analysis/blob/83d08f755d8a5100df94bd980647f48fbdd4c518/analyses/subtype-chordoma/01-Subtype-chordoma.Rmd#L11.

chordoma_smarcb1_df is in perfect shape for us to make the plot. Here's one way to do it:

# this specifies that this is the data we want to plot
chordoma_smarcb1_df %>%
  # drop the sample that doesn't have WGS data
  tidyr::drop_na() %>%
  # this step specifies what should go on the x- and y-axes
  ggplot(aes(x = focal_SMARCB1_status,
             y = SMARCB1_expression)) +
  # we want a jitter plot where the points aren't too far 
  # apart that's what width does
  geom_jitter(width = 0.1) +
  # this is plotting the median as a blue diamond
  stat_summary(fun.y = "median", 
               geom = "point", 
               size = 3, 
               color = "blue", 
               shape = 18)

There are lots of ways to customize the plot to your liking. One way is to change the theme (see ggtheme documentation), but you also can control the labels (check this resource out).

Let me know if you have any questions @mkoptyra :)

@jaclyn-taroni
Copy link
Member

Per this discussion https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/475/files#r371877081 - what's left here is to evaluate the poorly differentiated calls in light of how the expression levels look overall (#387) and to have a domain expert weigh in.

@jaclyn-taroni
Copy link
Member

I am going to call this subsumed by #608.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cnv Related to or requires CNV data in progress Someone is working on this issue, but feel free to propose an alternative approach! molecular subtyping Related to molecular subtyping of tumors proposed analysis snv Related to or requires SNV data transcriptomic Related to or requires transcriptomic data
Projects
None yet
Development

No branches or pull requests

3 participants