Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

PR 1 of n: Molecular Subtyping - HGG (Defining Lesions) #352

Merged

Conversation

cbethell
Copy link
Contributor

@cbethell cbethell commented Dec 18, 2019

Purpose/implementation Section

To molecularly subtype HGG samples.

What scientific question is your analysis addressing?

What are the samples in the OpenPBTA dataset that fit into the HGG molecular subtypes?

What was your approach?

I joined together the data that is relevant to molecular subtyping HGG samples, including the metadata, RNA expression, SNV, and CN data.

I began by following the plan in the comment here.

What GitHub issue does your pull request address?

This PR addresses issue #249.

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

  • Does this analysis appear to be correct?
  • Do the variables in the final data.frame seem suffice to molecularly subtype the HGG samples?
  • Is there any obvious refactoring needed?

Is there anything that you want to discuss further?

Note: A heatmap displaying the data in this PR is upcoming.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes, this PR is ready for review.

Results

What types of results are included (e.g., table, figure)?

A tsv file with the data in the final data.frame of the R notebook in this PR can be found in the results directory of this module at results/HGG_molecular_subtypes.tsv.

The table can also be viewed on the html output here.

What is your summary of the results?

I have not yet developed a summary of the current results beyond the final data.frame

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

PR Checklist

  • Run a linter
  • Set the seed (NA)
  • Comments and/or documentation up to date
  • Double check your paths
  • Spell check any Rmd file or md file
  • Restart R and run all notebooks fresh and save

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Dec 19, 2019

@cbethell my read of #249 is that we want a column that tells us about the presence or absence of the following specific mutations:

H3F3A K28M, H3F3A G35R/V or HIST1H3B K28M

Where the first step is to check all samples for these and then the subsequent steps should include all samples already classified as HGG + any that would be reclassified on the basis of the presence of these mutations.

We also want to check for the presence or absence of

IDH1 R132H

and looks like IDH1 R172, too.

EDIT: Check BRAF V600E as well. It should only be present in LGG.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cbethell,

I'm going to outline my interpretation of what #249 is asking for and you, @jharenza, and I can go back and forth as needed.

I would split this up into (at least) two stages: 1) check every sample for defining lesions outlined in #249 and 2) wrangling the other information for all the HGG samples via disease_type_new + any samples that were not yet classified has HGG but should be based on the defining lesion.

So, the first notebook looks at the defining lesions and essentially produces the following table where the mutation columns are a binary outcome:

Kids_First_Participant_ID sample_id Kids_First_Biospecimen_ID H3F3A K28M HIST1H3B K28M H3F3A G35R H3F3A G35V

This notebook should note any inconsistencies, e.g., samples that would need to be reclassified.

I think what would come next is a script that subsets the HGG files, much like the approach you took with the ATRT subset files, that is not run in CI. The subset files should contain samples already labeled as HGG and those that were "picked up" because of the presence of a defining lesion.

You would then use those subset files to address part 2, where I think the final table will look like:

Kids_First_Participant_ID sample_id Kids_First_Biospecimen_ID age at diagnosis (days) glioma brain region H3F3A K28M HIST1H3B K28M H3F3A G35R/V ACRV1 mutated TP53 mutated ATRX mutated PDGFRA copy status PTEN copy status FGFR1 mutated or fused SETD2 mutated NTRK fused FOXG1 z-score OLIG2 z-score chr7 status chr10 status IDH1 R132 MYCN copy status TERT mutated ...

Where the * mutated and * fused columns are binary outcomes. At first, I would limit the files that you look at for an individual gene to those that are explicitly mentioned in #249, e.g., only look at NTRK in the fusion file. The rationale is that this is going to be a lot of information to digest and we can always go back and look at additional data types if something is ambiguous or if it is requested. As for the chr7 and chr10 status, I think you can use the broad_values_by_arm.txt file from GISTIC (related: #344 (comment)).

You might want to play with transposing this table. Alternatively you may want to split this up into multiple tables, one for each of the named subtypes, which I think might accomplish something similar to your Cooccuring_lesions column, We'll have to figure out how the information is best presented.

"HIST1H3B",
"ACVR1",
"ATRX",
"PDGFRA",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I interpret

PDGFRA amplification; PTEN loss

to mean copy number changes, not mutations

)

# Read in consensus mutation data
tmb_df <-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not reading in the tumor mutation burden data, which is what I would expect based on tmb_df.

"IDH1",
"BRAF")

H3_G35 <- c("H3F3A", "SETD2", "NTRK", "IDH1", "ATRX", "DAXX")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I would look in the fusions file for the NTRK information not the mutations file.

) %>%
dplyr::group_by(sample_id) %>%
dplyr::mutate(
OLIG2_expression = paste(sort(unique(OLIG2)), collapse = ", "),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this step necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this step was to ensure that there was only one row per sample_id, however this step was taken out of this particular PR and will be revisited in an upcoming PR (I believe there were duplicate rows for a reason other than the expression values).

```{r warning = FALSE}
# Filter manta SV data for the target lesions and join this data.frame with
# the selected variables of the metadata
sv_df_filtered <- sv_df %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on what is on #249, I would expect that

Co-deletion of chr 1p and 19q (LOH, loss of heterozygosity of both) results in translocation t(1p;19q)

Is what would be extracted from the Manta file. But I see that these are not necessarily required #249 (comment). I would add that you are using files that have already been annotated (controlfreec_annotated_cn_autosomes.tsv.gz), so I don't know that the SV files are more straightforward to use than the CNV files at this point. Because we expect to have consensus copy number files (#128) that will probably get used for all the subtyping, I would recommend sticking with the files from focal-cn-file-preparation. It's not clear to me that we will use AnnotSV, which I believe is what adds the gene name to the Manta output, on the consensus file.

- remove `results/HGG_molecular_subtypes.tsv`
- new output file `results/HGG_defining_lesions.tsv` contains binary columns for all samples distinguishing whether or not they contain any of the four HGG defining lesions
- rename `01` nb to better represent its purpose/content
- rename object `tmb_df` to `snv_df`
@cbethell
Copy link
Contributor Author

cbethell commented Jan 3, 2020

Per @jaclyn-taroni's suggestions in this comment, this PR is now a notebook that looks only at the HGG defining lesions across all samples.

The upcoming PR will be a script that subsets HGG samples, and a third PR will then incorporate all other relevant data (eg. fusion, CN, RNA expression data).

@cbethell cbethell changed the title PR 1 of n: Molecular Subtyping - HGG (Data Prep) PR 1 of n: Molecular Subtyping - HGG (Defining Lesions) Jan 3, 2020
hgg_samples <- snv_lesions_df %>%
dplyr::filter(
disease_type_reclassified == "High-grade glioma" &
disease_type_new != "High-grade glioma;astrocytoma (WHO grade III/IV)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the v12 data I think this should be

Suggested change
disease_type_new != "High-grade glioma;astrocytoma (WHO grade III/IV)"
disease_type_new != "High-grade glioma"

@jaclyn-taroni
Copy link
Member

@cbethell can you provide a link to a rendered version of this updated notebook? I want @jharenza to have a look at the inconsistencies table.

@cbethell
Copy link
Contributor Author

cbethell commented Jan 3, 2020

@cbethell can you provide a link to a rendered version of this updated notebook? I want @jharenza to have a look at the inconsistencies table.

The rendered version of this updated notebook can be found here.

@jharenza jharenza self-requested a review January 4, 2020 01:57
Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbethell this looks great so far, and looks like you found an ependymoma and ganglioglioma that should be reclassified. I am requesting one change, based on the bolded subtypes I had described in #249. This is the detail we will want in the molecular_subtype column. Other than that, I think this is good to go as a first step.

Comment on lines 127 to 135
dplyr::mutate(
disease_type_reclassified = dplyr::case_when(
H3F3A.K28M == "Yes" |
HIST1H3B.K28M == "Yes" |
H3F3A.G35R == "Yes" |
H3F3A.G35V == "Yes" ~ "High-grade glioma",
TRUE ~ as.character(disease_type_new)
)
)
Copy link
Collaborator

@jharenza jharenza Jan 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an additional column here for the molecular_subtype - something like:
HGG, H3 K28 mutant or High-grade glioma, H3 K28 mutant
HGG, H3 G35 mutant or High-grade glioma, H3 G35 mutant

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made that update in a10cded. Note that this table is not the final table from this module, but an interim product #352 (review).

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM, let's subset some files 💪

@jaclyn-taroni
Copy link
Member

I'm going to dismiss @jharenza's review because those comments have now been addressed (a10cded) and I would like to reduce the backlog of pull requests we have from before the break.

@jaclyn-taroni jaclyn-taroni dismissed jharenza’s stale review January 4, 2020 20:57

Comments have been addressed a10cded

@jaclyn-taroni jaclyn-taroni merged commit d26866f into AlexsLemonade:master Jan 4, 2020
@jharenza jharenza mentioned this pull request Jan 5, 2020
@cbethell cbethell deleted the hgg-molecular-subtyping-data-prep branch February 6, 2020 20:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants