-
Notifications
You must be signed in to change notification settings - Fork 67
Add chr22q
loss variable to ATRT molecular subtyping
#414
Add chr22q
loss variable to ATRT molecular subtyping
#414
Conversation
- subset the GISTIC file to only ATRT samples - use `cnvkit` instead of `controlfreec` for focal CN data (to coincide with GISTIC) - rerun plots and include `chr22q_loss` on annotated heatmap
The html output with the final table displayed can be found here. |
It could be useful to organize the table columns such that the relevant information for each of the subtypes on #244 is grouped/presented next to each other. I think we at least want the SMARCB1 status and chr22q columns to be grouped and come earlier (further to the left) in the table. |
- rearrange columns in final table in order of relevance to each ATRT subgroup
View the table reflecting the suggestion above here. |
- update expression subset file using V12 data
@cbethell I noticed a couple instances where GISTIC data is missing but the focal data is not and vice versa (here are two examples):
Do you know why that might be? |
@jaclyn-taroni the |
The GISTIC data and focal CN data you are using are both derived from the CNVkit data. That suggests to me that something else might be going on. Are all the samples that end up in that table (with WGS data) in the CNVkit file? |
@jaclyn-taroni all of the samples are in the CNVkit file, but they do not all have data for |
Ah, okay - so this suggests to me that these should be set as |
Yes, that is a good point. I will implement this change now. |
@cbethell can you generate a list of |
@jaclyn-taroni yes, I'm on it 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cbethell ! Thanks for doing this so quickly. I noticed in your TSV file, you have 3 columns which are duplicated (some have NAs): Kids_First_Participant_ID.x.x, biospecimen_id, Kids_First_Participant_ID.y.y, so I think those can be removed. I will start to dig into this and try to subtype these.
Hi! Regarding this - I checked the |
@jharenza we figured that there was some GISTIC filtering step. We do not have much documentation around the GISTIC files at the moment: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/data-formats.md#copy-number-files |
OK - I think I will have to update that. #417 |
@jharenza thank you for this clarification. It indeed does explain the discrepancies as I cross checked said samples in a separate notebook here. I have also removed the duplicated columns in the final tsv file. |
My review of #410 made me think that a similar issue might be happening here with the CNVkit data. I have a few changes that I have not yet pushed because they conflict with the last commit and I should take another look. Because of how we filtered the subset files initially, we were inadvertently dropping a few samples without transcriptomic data so the table will have some additional samples but largely remain the same. |
fedd788
to
dc4c62e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I went back through. I had a few questions about subsetting the GISTIC file, I left comments to explain the changes I made, and I pointed out a couple decision points that I think @jharenza should be aware of when looking at the table.
"broad_values_by_arm.txt" | ||
), | ||
exdir = file.path(root_dir, "scratch") | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
)) | |
), data.table = FALSE) |
#### Filter GISTIC data -------------------------------------------------------- | ||
|
||
gistic_df <- gistic_df %>% | ||
as.data.frame() %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to call as.data.frame()
here to get the tibble::column_to_rownames
step to work? Asking because you set this as a data.frame
after the transpose. If it's necessary to do this twice, I think it may be because gistic_df
is a data.table
(see my comment above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct.
as.data.frame() %>% | ||
tibble::rownames_to_column("Kids_First_Biospecimen_ID") %>% | ||
dplyr::left_join(select_metadata, by = "Kids_First_Biospecimen_ID") %>% | ||
dplyr::filter(sample_id %in% atrt_df$sample_id) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can move up the filtering step such that you are removing non-ATRT biospecimens from the GISTIC data and then join with the metadata. In general if you can filter before joining, that is a better design pattern because you end up working with smaller data.frame
.
sample_id, | ||
Kids_First_Participant_ID | ||
) %>% | ||
dplyr::summarize_all(function(x) paste(sort(unique(x)), collapse = ", ")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbethell if we do this collapse biospecimen IDs step up front, we can drop the biospecimen ID columns as we go since we join on sample_id
. This means we don't have to collapse at the end and run the risk of having a bunch of duplicated columns.
|
||
# Write to file | ||
readr::write_tsv(atrt_df, file.path(results_dir, "atrt_histologies.tsv")) | ||
sample_type == "Tumor", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filtering by experimental_strategy == "RNA-Seq"
also effectively drops normal samples, but it also made it such that any ATRT samples missing RNA-seq data got dropped. We don't need to write the ATRT subset histologies file as a TSV because the CI histologies file is always the same file as a data release:
cp $FULL_DIRECTORY/pbta-histologies.tsv $SUBSET_DIRECTORY |
The reason why we can include the full pbta-histologies.tsv
for testing is two-fold: 1) it's pretty small and 2) having "extra" biospecimens/samples/participants in the histologies file helps test for brittle code.
readr::read_tsv(file.path(input_dir, "atrt_histologies.tsv")) | ||
subset_metadata <- metadata %>% | ||
dplyr::filter(short_histology == "ATRT", | ||
sample_type == "Tumor", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting rid of the normal and cell line data here.
dplyr::left_join(expression_metadata, | ||
by = "sample_id") | ||
|
||
# Remove data we no longer need | ||
rm(filtered_expression, long_stranded_expression, expression_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing from the workspace as we go makes it such that we don't have data.frame
we no longer need in memory.
analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd
Show resolved
Hide resolved
analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM
Purpose/implementation Section
The purpose of this PR is to add chr 22 loss information to the ATRT subtype data.frame and tsv file.
What scientific question is your analysis addressing?
This analysis is addressing the molecular subtyping of ATRT samples.
What was your approach?
I used the
broad_values_by_arm.txt
file from GISTIC to obtain the chr22q data for the ATRT samples.What GitHub issue does your pull request address?
This PR addresses issue #244.
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Does this additional information appear to be accurately represented?
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes, this is ready for review.
Results
What types of results are included (e.g., table, figure)?
This PR includes changes to the heatmap:
This plot can be viewed in the
README
here.It also includes changes to the final output tsv file:
Reproducibility Checklist
analyses/README.md
.