-
Notifications
You must be signed in to change notification settings - Fork 67
Reconfigure how SNV-callers uses BED files for TMB calculations #671
Conversation
dplyr::select(Tumor_Sample_Barcode, target_bed, target_bed_path) %>% | ||
dplyr::distinct() | ||
|
||
# Pull out file paths as a vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be a shorter, less confusing way to set up this section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if there is a shorter way, but it is not confusing as if there aren't any alternative suggestions it will be fine to leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cansavvy this looks great!
Thank you for the detailed outline of your approach, it was very helpful in my case (seeing as I am not as familiar with this analysis).
I have a few questions and suggestions below re some points that can use clarification.
Beyond these, after going through the function and and main R script code, I do not have any structural changes to suggest at this time.
dplyr::select(Tumor_Sample_Barcode, target_bed, target_bed_path) %>% | ||
dplyr::distinct() | ||
|
||
# Pull out file paths as a vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if there is a shorter way, but it is not confusing as if there aren't any alternative suggestions it will be fine to leave it as is.
# Calculate Tumor Mutational Burden a given sample in `Tumor_Sample_Barcode` | ||
# given a target region BED frame. This function uses `snv_ranges_filter` to | ||
# filter out SNVs outside those target BED ranges and uses the total size in | ||
# bp of those BED ranges for the TMB denominator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the calculate_tmb function, I have it carry over the experimental_strategy and short_histology columns which is fine for the PBTA and TCGA data, but if we use this elsewhere, that will throw an error.
I like that this is a function so I would opt to make a note of this particular caveat in the function description here.
However, @jashapiro may have different feelings on this so you may want to wait to hear his thoughts on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My quote there was trying to ask whether I should leave out those column references from this function and tack on those columns outside this function. Is that what you are talking about @cbethell ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a problem with this, though personally I would probably just leave it to the user to tack on histology outside the function. Mostly because who knows what the next user will want (short_histology
vs integrated_diagnosis
for example). But the space used by including some things is small, so I don't think it maters. But I think including experimental_strategy
makes sense, since it is integral to interpretation of the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, with a few minor tweaks, mostly documentation and a few tweaks for clarity. I think the logic all works properly, though I admit I did not test it carefully.
@@ -242,63 +225,152 @@ if (opt$tcga) { | |||
# Add in metadata | |||
strelka_mutect_maf_df <- strelka_mutect_maf_df %>% | |||
dplyr::inner_join(metadata %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does anything get lost where we don't have matching rows "Tumor_Sample_Barcode"s? (i.e. is the inner_join equivalent to left_join here?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would hope we don't have samples in the MAF files that are not in metadata, but in case that were to happen, an inner_join
makes sure this continues. But yes, this should essentially be a left_join
unless metadata
is messed up.
# Calculate Tumor Mutational Burden a given sample in `Tumor_Sample_Barcode` | ||
# given a target region BED frame. This function uses `snv_ranges_filter` to | ||
# filter out SNVs outside those target BED ranges and uses the total size in | ||
# bp of those BED ranges for the TMB denominator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a problem with this, though personally I would probably just leave it to the user to tack on histology outside the function. Mostly because who knows what the next user will want (short_histology
vs integrated_diagnosis
for example). But the space used by including some things is small, so I don't think it maters. But I think including experimental_strategy
makes sense, since it is integral to interpretation of the results.
I'm going to run this on AWS. In the meantime I want to leave this PR open in case running the whole thing brings up some points that need to be tweaked. |
This runs successfully and is ready to be merged. The only thing that needs to be updated for this is that part of the TMB figure (the part in the figure generation script) is not adherent to the unified color palette. I'm going to file a separate PR to address this. |
Purpose/implementation Section
What scientific question is your analysis addressing?
Same scientific question, but now setting up for the revised BED files that have been added to the v16 release for TCGA data.
Because these BED files aren't assigned to biospecimen's soley by their experimental strategy, this means I had to re-structure how
03-calculate_tmb.R
script processes target region BED files.What was your approach?
I reconfigured the
calculate_tmb
function to be set up for processing one biospecimen at a time (based on theTumor_Sample_Barcode
column. (Previously it calculated TMB for the whole MAF at once).Reconfigured the associated
03-calculate-tmb.R
script to set up a column that specifies what BED file goes to which sample as a part of the metadata data frame calledtarget_bed
. This also creates an associated column based on the root_directory that has the full file path (calledtarget_bed_path
).Used the
target_bed
column to get down to the unique BED files associated with the whole set of samples and then read in each BED file and convert to a named list of Genomic Ranges. There is a second list made from this list which takes the intersection of these BED files and the coding regions BED file which is a new optparse passed to the script (this is used for the coding_tmb calculations).For each sample's, the
calculate_tmb
function takes the tumor_sample_barcode, the original MAF file, and the name of the associated BED file. This function does the same filtering out of mutations outside of the BED range (from either the coding BED list or the original BED list), and then counts the number of remaining mutations / the total size of the BED ranges provided.What GitHub issue does your pull request address?
#668
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Are the BED ranges being handled as they should be? i.e. are there any places where something seems fishy? Perhaps a
reduce
needs to be used, or any parameters should be altered? (Keep in mind previously intersect with coding regions step was being completed by bedtools and now it is being done in R).In the
calculate_tmb
function, I have it carry over theexperimental_strategy
andshort_histology
columns which is fine for the PBTA and TCGA data, but if we use this elsewhere, that will throw an error. Should I take that out of the function and just do a join after the calculation? Or is this fine for now?Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
No.
Results
What types of results are included (e.g., table, figure)?
No results yet. Once I get the code itself approved, I will need to run everything on AWS and get a new version of the TMB files as well as a new figure 2.
What is your summary of the results?
That will be in a subsequent PR when I run everything from soup to nuts and obtain a refreshed mutational landscape figure (fig2).
Reproducibility Checklist
These items were not affected by these changes.
Documentation Checklist
Need to do more updates the analysis README yet.
README
and it is up to date.analyses/README.md
and the entry is up to date.