Add `chr22q` loss variable to ATRT molecular subtyping #414

cbethell · 2020-01-08T16:18:04Z

Purpose/implementation Section

The purpose of this PR is to add chr 22 loss information to the ATRT subtype data.frame and tsv file.

What scientific question is your analysis addressing?

This analysis is addressing the molecular subtyping of ATRT samples.

What was your approach?

I used the broad_values_by_arm.txt file from GISTIC to obtain the chr22q data for the ATRT samples.

What GitHub issue does your pull request address?

This PR addresses issue #244.

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Does this additional information appear to be accurately represented?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes, this is ready for review.

Results

What types of results are included (e.g., table, figure)?

This PR includes changes to the heatmap:

plots/atrt_heatmap.png

This plot can be viewed in the README here.

It also includes changes to the final output tsv file:

results/ATRT_molecular_subtypes.tsv.gz

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.
This analysis is recorded in the table in analyses/README.md.

- subset the GISTIC file to only ATRT samples - use `cnvkit` instead of `controlfreec` for focal CN data (to coincide with GISTIC) - rerun plots and include `chr22q_loss` on annotated heatmap

cbethell · 2020-01-08T16:29:02Z

The html output with the final table displayed can be found here.

jaclyn-taroni · 2020-01-08T16:34:54Z

It could be useful to organize the table columns such that the relevant information for each of the subtypes on #244 is grouped/presented next to each other. I think we at least want the SMARCB1 status and chr22q columns to be grouped and come earlier (further to the left) in the table.

- rearrange columns in final table in order of relevance to each ATRT subgroup

cbethell · 2020-01-08T17:12:56Z

It could be useful to organize the table columns such that the relevant information for each of the subtypes on #244 is grouped/presented next to each other. I think we at least want the SMARCB1 status and chr22q columns to be grouped and come earlier (further to the left) in the table.

View the table reflecting the suggestion above here.

- update expression subset file using V12 data

jaclyn-taroni · 2020-01-08T17:41:23Z

@cbethell I noticed a couple instances where GISTIC data is missing but the focal data is not and vice versa (here are two examples):

sample_id	Kids_First_Biospecimen_ID	Kids_First_Participant_ID	chr_22q_loss	SMARCB1_focal_status
7316-376	BS_74A1TB03, BS_M4923M40	PT_MTE126WM	NA	loss, neutral
7316-2187	BS_53TV75NN, BS_850BAHH9	PT_HVZTF42R	No	NA

Do you know why that might be?

cbethell · 2020-01-08T18:08:19Z

@cbethell I noticed a couple instances where GISTIC data is missing but the focal data is not and vice versa (here are two examples):

sample_id Kids_First_Biospecimen_ID Kids_First_Participant_ID chr_22q_loss SMARCB1_focal_status
7316-376 BS_74A1TB03, BS_M4923M40 PT_MTE126WM NA loss, neutral
7316-2187 BS_53TV75NN, BS_850BAHH9 PT_HVZTF42R No NA
Do you know why that might be?

@jaclyn-taroni the NA's are due to data missing for these samples from the gistic data and focal data, respectively. I am not exactly sure why this may be case, but my guess is the answer would be upstream in the preparation of said files.

jaclyn-taroni · 2020-01-08T18:12:35Z

The GISTIC data and focal CN data you are using are both derived from the CNVkit data. That suggests to me that something else might be going on. Are all the samples that end up in that table (with WGS data) in the CNVkit file?

cbethell · 2020-01-08T18:20:53Z

@jaclyn-taroni all of the samples are in the CNVkit file, but they do not all have data for SMARCB1. These samples are however not all in the GISTIC file.

jaclyn-taroni · 2020-01-08T18:23:20Z

all of the samples are in the CNVkit file, but they do not all have data for SMARCB1.

Ah, okay - so this suggests to me that these should be set as neutral or something else rather than NA because it's not that the data is missing but that we don't have evidence that it's a loss - do you agree?

cbethell · 2020-01-08T18:24:45Z

Ah, okay - so this suggests to me that these should be set as neutral or something else rather than NA because it's not that the data is missing but that we don't have evidence that it's a loss - do you agree?

Yes, that is a good point. I will implement this change now.

jaclyn-taroni · 2020-01-08T18:26:56Z

@cbethell can you generate a list of Kids_First_Biospecimen_ID that are in the CNVkit data but not in the GISTIC data and file a data issue please?

cbethell · 2020-01-08T18:30:35Z

@cbethell can you generate a list of Kids_First_Biospecimen_ID that are in the CNVkit data but not in the GISTIC data and file a data issue please?

@jaclyn-taroni yes, I'm on it 👍

jharenza

Hi @cbethell ! Thanks for doing this so quickly. I noticed in your TSV file, you have 3 columns which are duplicated (some have NAs): Kids_First_Participant_ID.x.x, biospecimen_id, Kids_First_Participant_ID.y.y, so I think those can be removed. I will start to dig into this and try to subtype these.

jharenza · 2020-01-09T00:44:54Z

@cbethell can you generate a list of Kids_First_Biospecimen_ID that are in the CNVkit data but not in the GISTIC data and file a data issue please?

@jaclyn-taroni yes, I'm on it 👍

Hi! Regarding this - I checked the sample_seg_counts.txt file that prints from the GISTIC run and found that the one sample that had NA in GISTIC but had CNVkit calls, BS_74A1TB03, actually had too many segments for GISTIC (>2500) for calls to be made and was excluded from GISTIC analyses. The program deems these too noisy. The sample_seg_counts.txt file has all sample segment counts and whether they were included or excluded, so that explains the discrepancies. Looks like 18 samples were excluded.

jaclyn-taroni · 2020-01-09T00:47:13Z

@jharenza we figured that there was some GISTIC filtering step. We do not have much documentation around the GISTIC files at the moment: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/data-formats.md#copy-number-files

jharenza · 2020-01-09T00:48:35Z

OK - I think I will have to update that. #417

cbethell · 2020-01-09T02:34:42Z

Hi! Regarding this - I checked the sample_seg_counts.txt file that prints from the GISTIC run and found that the one sample that had NA in GISTIC but had CNVkit calls, BS_74A1TB03, actually had too many segments for GISTIC (>2500) for calls to be made and was excluded from GISTIC analyses. The sample_seg_counts.txt file has all sample segment counts and whether they were included or excluded, so that explains the discrepancies.

@jharenza thank you for this clarification. It indeed does explain the discrepancies as I cross checked said samples in a separate notebook here.

I have also removed the duplicated columns in the final tsv file.

jaclyn-taroni · 2020-01-09T02:50:41Z

My review of #410 made me think that a similar issue might be happening here with the CNVkit data. I have a few changes that I have not yet pushed because they conflict with the last commit and I should take another look. Because of how we filtered the subset files initially, we were inadvertently dropping a few samples without transcriptomic data so the table will have some additional samples but largely remain the same.

jaclyn-taroni

Okay I went back through. I had a few questions about subsetting the GISTIC file, I left comments to explain the changes I made, and I pointed out a couple decision points that I think @jharenza should be aware of when looking at the table.

jaclyn-taroni · 2020-01-09T12:22:13Z

analyses/molecular-subtyping-ATRT/00-subset-files-for-ATRT.R

+      "broad_values_by_arm.txt"
+    ),
+    exdir = file.path(root_dir, "scratch")
+  ))


Suggested change

))

), data.table = FALSE)

jaclyn-taroni · 2020-01-09T12:25:32Z

analyses/molecular-subtyping-ATRT/00-subset-files-for-ATRT.R

+#### Filter GISTIC data --------------------------------------------------------
+
+gistic_df <- gistic_df %>%
+  as.data.frame() %>%


Do you need to call as.data.frame() here to get the tibble::column_to_rownames step to work? Asking because you set this as a data.frame after the transpose. If it's necessary to do this twice, I think it may be because gistic_df is a data.table (see my comment above).

Yes, you are correct.

jaclyn-taroni · 2020-01-09T12:34:32Z

analyses/molecular-subtyping-ATRT/00-subset-files-for-ATRT.R

+  as.data.frame() %>%
+  tibble::rownames_to_column("Kids_First_Biospecimen_ID") %>%
+  dplyr::left_join(select_metadata, by = "Kids_First_Biospecimen_ID") %>%
+  dplyr::filter(sample_id %in% atrt_df$sample_id) %>%


I think you can move up the filtering step such that you are removing non-ATRT biospecimens from the GISTIC data and then join with the metadata. In general if you can filter before joining, that is a better design pattern because you end up working with smaller data.frame.

jaclyn-taroni · 2020-01-09T12:36:14Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

+    sample_id,
+    Kids_First_Participant_ID
+  ) %>%
+  dplyr::summarize_all(function(x) paste(sort(unique(x)), collapse = ", "))


@cbethell if we do this collapse biospecimen IDs step up front, we can drop the biospecimen ID columns as we go since we join on sample_id. This means we don't have to collapse at the end and run the risk of having a bunch of duplicated columns.

jaclyn-taroni · 2020-01-09T12:38:03Z

analyses/molecular-subtyping-ATRT/00-subset-files-for-ATRT.R

-
-# Write to file
-readr::write_tsv(atrt_df, file.path(results_dir, "atrt_histologies.tsv"))
+                sample_type == "Tumor",


Filtering by experimental_strategy == "RNA-Seq" also effectively drops normal samples, but it also made it such that any ATRT samples missing RNA-seq data got dropped. We don't need to write the ATRT subset histologies file as a TSV because the CI histologies file is always the same file as a data release:

OpenPBTA-analysis/analyses/create-subset-files/create_subset_files.sh

Line 41 in 2cbf526

cp $FULL_DIRECTORY/pbta-histologies.tsv $SUBSET_DIRECTORY

The reason why we can include the full pbta-histologies.tsv for testing is two-fold: 1) it's pretty small and 2) having "extra" biospecimens/samples/participants in the histologies file helps test for brittle code.

jaclyn-taroni · 2020-01-09T12:40:52Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

-  readr::read_tsv(file.path(input_dir, "atrt_histologies.tsv"))
+subset_metadata <- metadata %>%
+  dplyr::filter(short_histology == "ATRT",
+                sample_type == "Tumor",


Getting rid of the normal and cell line data here.

jaclyn-taroni · 2020-01-09T12:41:38Z

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

  dplyr::left_join(expression_metadata,
                   by = "sample_id")
+
+# Remove data we no longer need
+rm(filtered_expression, long_stranded_expression, expression_metadata)


Removing from the workspace as we go makes it such that we don't have data.frame we no longer need in memory.

analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

jaclyn-taroni

👍 LGTM

cbethell and others added 2 commits January 8, 2020 11:03

Add chr22q loss variable to ATRT subtypes data.frame

d3d27d5

- subset the GISTIC file to only ATRT samples - use `cnvkit` instead of `controlfreec` for focal CN data (to coincide with GISTIC) - rerun plots and include `chr22q_loss` on annotated heatmap

Merge branch 'master' into add-chr-22-atrt

728b6fa

@jaclyn-taroni suggestion to rearrange columns

5750c00

- rearrange columns in final table in order of relevance to each ATRT subgroup

Update README to reflect this PR's changes

6425ddb

- update expression subset file using V12 data

cbethell and others added 3 commits January 8, 2020 14:05

Change NA values to neutral

c3da786

Attempt to correctly assign neutral to WGS samples in focal status

8bef332

Merge branch 'master' into add-chr-22-atrt

76c5c1a

jharenza suggested changes Jan 9, 2020

View reviewed changes

jaclyn-taroni and others added 4 commits January 8, 2020 20:16

Don't subset to the RNA-seq samples only

f4e9a5c

Remove duplicated columns in final table

fedd788

Tumor samples only just in case

8411796

Handle 'missing' copy number samples

57768d3

jaclyn-taroni added 2 commits January 9, 2020 06:58

Clean up a few things

1ac4e91

Use full histologies file

846d4f2

jaclyn-taroni added 2 commits January 9, 2020 07:03

We recovered samples w/o RNA-seq data

3a46c7e

Rerun module

dc4c62e

jaclyn-taroni force-pushed the add-chr-22-atrt branch from fedd788 to dc4c62e Compare January 9, 2020 12:04

Use different data.frame, save a few lines

7f01263

jaclyn-taroni requested a review from jharenza January 9, 2020 12:20

jaclyn-taroni reviewed Jan 9, 2020

View reviewed changes

Merge branch 'master' into add-chr-22-atrt

b7efffa

cbethell mentioned this pull request Jan 9, 2020

Update GISTIC data format information #417

Closed

cbethell and others added 2 commits January 9, 2020 10:53

Add data.table = FALSE argument and rerun subset script

c3a5447

Update README.md

40902a8

jaclyn-taroni approved these changes Jan 9, 2020

View reviewed changes

jharenza approved these changes Jan 9, 2020

View reviewed changes

jaclyn-taroni added 2 commits January 9, 2020 15:15

Merge branch 'master' into add-chr-22-atrt

504e89e

Merge branch 'master' into add-chr-22-atrt

5c65f37

jaclyn-taroni merged commit 5626e43 into AlexsLemonade:master Jan 10, 2020

jaclyn-taroni mentioned this pull request Jan 18, 2020

Update molecular-subtyping-ATRT modules at a glance entry #454

Merged

cbethell deleted the add-chr-22-atrt branch February 6, 2020 20:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `chr22q` loss variable to ATRT molecular subtyping #414

Add `chr22q` loss variable to ATRT molecular subtyping #414

cbethell commented Jan 8, 2020 •

edited

Loading

cbethell commented Jan 8, 2020

jaclyn-taroni commented Jan 8, 2020 •

edited

Loading

cbethell commented Jan 8, 2020

jaclyn-taroni commented Jan 8, 2020

cbethell commented Jan 8, 2020 •

edited

Loading

jaclyn-taroni commented Jan 8, 2020

cbethell commented Jan 8, 2020

jaclyn-taroni commented Jan 8, 2020 •

edited

Loading

cbethell commented Jan 8, 2020

jaclyn-taroni commented Jan 8, 2020

cbethell commented Jan 8, 2020

jharenza left a comment

jharenza commented Jan 9, 2020 •

edited

Loading

jaclyn-taroni commented Jan 9, 2020

jharenza commented Jan 9, 2020 •

edited

Loading

cbethell commented Jan 9, 2020

jaclyn-taroni commented Jan 9, 2020

jaclyn-taroni left a comment

jaclyn-taroni Jan 9, 2020

jaclyn-taroni Jan 9, 2020

cbethell Jan 9, 2020

jaclyn-taroni Jan 9, 2020

jaclyn-taroni Jan 9, 2020

jaclyn-taroni Jan 9, 2020

jaclyn-taroni Jan 9, 2020

jaclyn-taroni Jan 9, 2020

jaclyn-taroni left a comment

Add chr22q loss variable to ATRT molecular subtyping #414

Add chr22q loss variable to ATRT molecular subtyping #414

Conversation

cbethell commented Jan 8, 2020 • edited Loading

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

Reproducibility Checklist

cbethell commented Jan 8, 2020

jaclyn-taroni commented Jan 8, 2020 • edited Loading

cbethell commented Jan 8, 2020

jaclyn-taroni commented Jan 8, 2020

cbethell commented Jan 8, 2020 • edited Loading

jaclyn-taroni commented Jan 8, 2020

cbethell commented Jan 8, 2020

jaclyn-taroni commented Jan 8, 2020 • edited Loading

cbethell commented Jan 8, 2020

jaclyn-taroni commented Jan 8, 2020

cbethell commented Jan 8, 2020

jharenza left a comment

Choose a reason for hiding this comment

jharenza commented Jan 9, 2020 • edited Loading

jaclyn-taroni commented Jan 9, 2020

jharenza commented Jan 9, 2020 • edited Loading

cbethell commented Jan 9, 2020

jaclyn-taroni commented Jan 9, 2020

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Add `chr22q` loss variable to ATRT molecular subtyping #414

Add `chr22q` loss variable to ATRT molecular subtyping #414

cbethell commented Jan 8, 2020 •

edited

Loading

jaclyn-taroni commented Jan 8, 2020 •

edited

Loading

cbethell commented Jan 8, 2020 •

edited

Loading

jaclyn-taroni commented Jan 8, 2020 •

edited

Loading

jharenza commented Jan 9, 2020 •

edited

Loading

jharenza commented Jan 9, 2020 •

edited

Loading