Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Add chr22q loss variable to ATRT molecular subtyping #414

Merged
merged 21 commits into from
Jan 10, 2020

Conversation

cbethell
Copy link
Contributor

@cbethell cbethell commented Jan 8, 2020

Purpose/implementation Section

The purpose of this PR is to add chr 22 loss information to the ATRT subtype data.frame and tsv file.

What scientific question is your analysis addressing?

This analysis is addressing the molecular subtyping of ATRT samples.

What was your approach?

I used the broad_values_by_arm.txt file from GISTIC to obtain the chr22q data for the ATRT samples.

What GitHub issue does your pull request address?

This PR addresses issue #244.

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Does this additional information appear to be accurately represented?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes, this is ready for review.

Results

What types of results are included (e.g., table, figure)?

This PR includes changes to the heatmap:

  • plots/atrt_heatmap.png

This plot can be viewed in the README here.

It also includes changes to the final output tsv file:

  • results/ATRT_molecular_subtypes.tsv.gz

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.
  • This analysis is recorded in the table in analyses/README.md.

cbethell and others added 2 commits January 8, 2020 11:03
- subset the GISTIC file to only ATRT samples
- use `cnvkit` instead of `controlfreec` for focal CN data (to coincide with GISTIC) 
- rerun plots and include `chr22q_loss` on annotated heatmap
@cbethell
Copy link
Contributor Author

cbethell commented Jan 8, 2020

The html output with the final table displayed can be found here.

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Jan 8, 2020

It could be useful to organize the table columns such that the relevant information for each of the subtypes on #244 is grouped/presented next to each other. I think we at least want the SMARCB1 status and chr22q columns to be grouped and come earlier (further to the left) in the table.

- rearrange columns in final table in order of relevance to each ATRT subgroup
@cbethell
Copy link
Contributor Author

cbethell commented Jan 8, 2020

It could be useful to organize the table columns such that the relevant information for each of the subtypes on #244 is grouped/presented next to each other. I think we at least want the SMARCB1 status and chr22q columns to be grouped and come earlier (further to the left) in the table.

View the table reflecting the suggestion above here.

- update expression subset file using V12 data
@jaclyn-taroni
Copy link
Member

@cbethell I noticed a couple instances where GISTIC data is missing but the focal data is not and vice versa (here are two examples):

sample_id Kids_First_Biospecimen_ID Kids_First_Participant_ID chr_22q_loss SMARCB1_focal_status
7316-376 BS_74A1TB03, BS_M4923M40 PT_MTE126WM NA loss, neutral
7316-2187 BS_53TV75NN, BS_850BAHH9 PT_HVZTF42R No NA

Do you know why that might be?

@cbethell
Copy link
Contributor Author

cbethell commented Jan 8, 2020

@cbethell I noticed a couple instances where GISTIC data is missing but the focal data is not and vice versa (here are two examples):

sample_id Kids_First_Biospecimen_ID Kids_First_Participant_ID chr_22q_loss SMARCB1_focal_status
7316-376 BS_74A1TB03, BS_M4923M40 PT_MTE126WM NA loss, neutral
7316-2187 BS_53TV75NN, BS_850BAHH9 PT_HVZTF42R No NA
Do you know why that might be?

@jaclyn-taroni the NA's are due to data missing for these samples from the gistic data and focal data, respectively. I am not exactly sure why this may be case, but my guess is the answer would be upstream in the preparation of said files.

@jaclyn-taroni
Copy link
Member

The GISTIC data and focal CN data you are using are both derived from the CNVkit data. That suggests to me that something else might be going on. Are all the samples that end up in that table (with WGS data) in the CNVkit file?

@cbethell
Copy link
Contributor Author

cbethell commented Jan 8, 2020

@jaclyn-taroni all of the samples are in the CNVkit file, but they do not all have data for SMARCB1. These samples are however not all in the GISTIC file.

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Jan 8, 2020

all of the samples are in the CNVkit file, but they do not all have data for SMARCB1.

Ah, okay - so this suggests to me that these should be set as neutral or something else rather than NA because it's not that the data is missing but that we don't have evidence that it's a loss - do you agree?

@cbethell
Copy link
Contributor Author

cbethell commented Jan 8, 2020

Ah, okay - so this suggests to me that these should be set as neutral or something else rather than NA because it's not that the data is missing but that we don't have evidence that it's a loss - do you agree?

Yes, that is a good point. I will implement this change now.

@jaclyn-taroni
Copy link
Member

@cbethell can you generate a list of Kids_First_Biospecimen_ID that are in the CNVkit data but not in the GISTIC data and file a data issue please?

@cbethell
Copy link
Contributor Author

cbethell commented Jan 8, 2020

@cbethell can you generate a list of Kids_First_Biospecimen_ID that are in the CNVkit data but not in the GISTIC data and file a data issue please?

@jaclyn-taroni yes, I'm on it 👍

Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cbethell ! Thanks for doing this so quickly. I noticed in your TSV file, you have 3 columns which are duplicated (some have NAs): Kids_First_Participant_ID.x.x, biospecimen_id, Kids_First_Participant_ID.y.y, so I think those can be removed. I will start to dig into this and try to subtype these.

@jharenza
Copy link
Collaborator

jharenza commented Jan 9, 2020

@cbethell can you generate a list of Kids_First_Biospecimen_ID that are in the CNVkit data but not in the GISTIC data and file a data issue please?

@jaclyn-taroni yes, I'm on it 👍

Hi! Regarding this - I checked the sample_seg_counts.txt file that prints from the GISTIC run and found that the one sample that had NA in GISTIC but had CNVkit calls, BS_74A1TB03, actually had too many segments for GISTIC (>2500) for calls to be made and was excluded from GISTIC analyses. The program deems these too noisy. The sample_seg_counts.txt file has all sample segment counts and whether they were included or excluded, so that explains the discrepancies. Looks like 18 samples were excluded.

@jaclyn-taroni
Copy link
Member

@jharenza we figured that there was some GISTIC filtering step. We do not have much documentation around the GISTIC files at the moment: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/data-formats.md#copy-number-files

@jharenza
Copy link
Collaborator

jharenza commented Jan 9, 2020

OK - I think I will have to update that. #417

@cbethell
Copy link
Contributor Author

cbethell commented Jan 9, 2020

Hi! Regarding this - I checked the sample_seg_counts.txt file that prints from the GISTIC run and found that the one sample that had NA in GISTIC but had CNVkit calls, BS_74A1TB03, actually had too many segments for GISTIC (>2500) for calls to be made and was excluded from GISTIC analyses. The sample_seg_counts.txt file has all sample segment counts and whether they were included or excluded, so that explains the discrepancies.

@jharenza thank you for this clarification. It indeed does explain the discrepancies as I cross checked said samples in a separate notebook here.

I have also removed the duplicated columns in the final tsv file.

@jaclyn-taroni
Copy link
Member

My review of #410 made me think that a similar issue might be happening here with the CNVkit data. I have a few changes that I have not yet pushed because they conflict with the last commit and I should take another look. Because of how we filtered the subset files initially, we were inadvertently dropping a few samples without transcriptomic data so the table will have some additional samples but largely remain the same.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I went back through. I had a few questions about subsetting the GISTIC file, I left comments to explain the changes I made, and I pointed out a couple decision points that I think @jharenza should be aware of when looking at the table.

"broad_values_by_arm.txt"
),
exdir = file.path(root_dir, "scratch")
))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
))
), data.table = FALSE)

#### Filter GISTIC data --------------------------------------------------------

gistic_df <- gistic_df %>%
as.data.frame() %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to call as.data.frame() here to get the tibble::column_to_rownames step to work? Asking because you set this as a data.frame after the transpose. If it's necessary to do this twice, I think it may be because gistic_df is a data.table (see my comment above).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct.

as.data.frame() %>%
tibble::rownames_to_column("Kids_First_Biospecimen_ID") %>%
dplyr::left_join(select_metadata, by = "Kids_First_Biospecimen_ID") %>%
dplyr::filter(sample_id %in% atrt_df$sample_id) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can move up the filtering step such that you are removing non-ATRT biospecimens from the GISTIC data and then join with the metadata. In general if you can filter before joining, that is a better design pattern because you end up working with smaller data.frame.

sample_id,
Kids_First_Participant_ID
) %>%
dplyr::summarize_all(function(x) paste(sort(unique(x)), collapse = ", "))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbethell if we do this collapse biospecimen IDs step up front, we can drop the biospecimen ID columns as we go since we join on sample_id. This means we don't have to collapse at the end and run the risk of having a bunch of duplicated columns.


# Write to file
readr::write_tsv(atrt_df, file.path(results_dir, "atrt_histologies.tsv"))
sample_type == "Tumor",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering by experimental_strategy == "RNA-Seq" also effectively drops normal samples, but it also made it such that any ATRT samples missing RNA-seq data got dropped. We don't need to write the ATRT subset histologies file as a TSV because the CI histologies file is always the same file as a data release:

cp $FULL_DIRECTORY/pbta-histologies.tsv $SUBSET_DIRECTORY

The reason why we can include the full pbta-histologies.tsv for testing is two-fold: 1) it's pretty small and 2) having "extra" biospecimens/samples/participants in the histologies file helps test for brittle code.

readr::read_tsv(file.path(input_dir, "atrt_histologies.tsv"))
subset_metadata <- metadata %>%
dplyr::filter(short_histology == "ATRT",
sample_type == "Tumor",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting rid of the normal and cell line data here.

dplyr::left_join(expression_metadata,
by = "sample_id")

# Remove data we no longer need
rm(filtered_expression, long_stranded_expression, expression_metadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing from the workspace as we go makes it such that we don't have data.frame we no longer need in memory.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

@jaclyn-taroni jaclyn-taroni merged commit 5626e43 into AlexsLemonade:master Jan 10, 2020
@cbethell cbethell deleted the add-chr-22-atrt branch February 6, 2020 20:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants