Reconfigure how SNV-callers uses BED files for TMB calculations #671

cansavvy · 2020-04-03T17:57:21Z

Purpose/implementation Section

What scientific question is your analysis addressing?

Same scientific question, but now setting up for the revised BED files that have been added to the v16 release for TCGA data.
Because these BED files aren't assigned to biospecimen's soley by their experimental strategy, this means I had to re-structure how 03-calculate_tmb.R script processes target region BED files.

What was your approach?

I reconfigured the calculate_tmb function to be set up for processing one biospecimen at a time (based on the Tumor_Sample_Barcode column. (Previously it calculated TMB for the whole MAF at once).
Reconfigured the associated 03-calculate-tmb.R script to set up a column that specifies what BED file goes to which sample as a part of the metadata data frame called target_bed. This also creates an associated column based on the root_directory that has the full file path (called target_bed_path).
Used the target_bed column to get down to the unique BED files associated with the whole set of samples and then read in each BED file and convert to a named list of Genomic Ranges. There is a second list made from this list which takes the intersection of these BED files and the coding regions BED file which is a new optparse passed to the script (this is used for the coding_tmb calculations).
For each sample's, the calculate_tmb function takes the tumor_sample_barcode, the original MAF file, and the name of the associated BED file. This function does the same filtering out of mutations outside of the BED range (from either the coding BED list or the original BED list), and then counts the number of remaining mutations / the total size of the BED ranges provided.

What GitHub issue does your pull request address?

#668

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Are the BED ranges being handled as they should be? i.e. are there any places where something seems fishy? Perhaps a reduce needs to be used, or any parameters should be altered? (Keep in mind previously intersect with coding regions step was being completed by bedtools and now it is being done in R).
In the calculate_tmb function, I have it carry over the experimental_strategy and short_histology columns which is fine for the PBTA and TCGA data, but if we use this elsewhere, that will throw an error. Should I take that out of the function and just do a join after the calculation? Or is this fine for now?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

No.

Results

What types of results are included (e.g., table, figure)?

No results yet. Once I get the code itself approved, I will need to run everything on AWS and get a new version of the TMB files as well as a new figure 2.

What is your summary of the results?

That will be in a subsequent PR when I run everything from soup to nuts and obtain a refreshed mutational landscape figure (fig2).

Reproducibility Checklist

These items were not affected by these changes.

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

Need to do more updates the analysis README yet.

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

cansavvy · 2020-04-03T18:12:01Z

analyses/snv-callers/scripts/03-calculate_tmb.R

+  dplyr::select(Tumor_Sample_Barcode, target_bed, target_bed_path) %>%
+  dplyr::distinct()
+
+# Pull out file paths as a vector


There might be a shorter, less confusing way to set up this section?

I am not sure if there is a shorter way, but it is not confusing as if there aren't any alternative suggestions it will be fine to leave it as is.

cbethell

@cansavvy this looks great!

Thank you for the detailed outline of your approach, it was very helpful in my case (seeing as I am not as familiar with this analysis).

I have a few questions and suggestions below re some points that can use clarification.
Beyond these, after going through the function and and main R script code, I do not have any structural changes to suggest at this time.

analyses/snv-callers/scripts/03-calculate_tmb.R

cbethell · 2020-04-07T16:20:17Z

analyses/snv-callers/scripts/03-calculate_tmb.R

+  dplyr::select(Tumor_Sample_Barcode, target_bed, target_bed_path) %>%
+  dplyr::distinct()
+
+# Pull out file paths as a vector


I am not sure if there is a shorter way, but it is not confusing as if there aren't any alternative suggestions it will be fine to leave it as is.

cbethell · 2020-04-07T17:03:48Z

analyses/snv-callers/util/tmb_functions.R

+  # Calculate Tumor Mutational Burden a given sample in `Tumor_Sample_Barcode`
+  # given a target region BED frame. This function uses `snv_ranges_filter` to
+  # filter out SNVs outside those target BED ranges and uses the total size in
+  # bp of those BED ranges for the TMB denominator.


In the calculate_tmb function, I have it carry over the experimental_strategy and short_histology columns which is fine for the PBTA and TCGA data, but if we use this elsewhere, that will throw an error.

I like that this is a function so I would opt to make a note of this particular caveat in the function description here.

However, @jashapiro may have different feelings on this so you may want to wait to hear his thoughts on it.

My quote there was trying to ask whether I should leave out those column references from this function and tack on those columns outside this function. Is that what you are talking about @cbethell ?

Yes, that is correct.

I don't have a problem with this, though personally I would probably just leave it to the user to tack on histology outside the function. Mostly because who knows what the next user will want (short_histology vs integrated_diagnosis for example). But the space used by including some things is small, so I don't think it maters. But I think including experimental_strategy makes sense, since it is integral to interpretation of the results.

jashapiro

This looks good to me, with a few minor tweaks, mostly documentation and a few tweaks for clarity. I think the logic all works properly, though I admit I did not test it carefully.

.circleci/config.yml

analyses/snv-callers/scripts/03-calculate_tmb.R

jashapiro · 2020-04-08T14:02:17Z

analyses/snv-callers/scripts/03-calculate_tmb.R

@@ -242,63 +225,152 @@ if (opt$tcga) {
 # Add in metadata
 strelka_mutect_maf_df <- strelka_mutect_maf_df %>%
  dplyr::inner_join(metadata %>%


does anything get lost where we don't have matching rows "Tumor_Sample_Barcode"s? (i.e. is the inner_join equivalent to left_join here?)

I would hope we don't have samples in the MAF files that are not in metadata, but in case that were to happen, an inner_join makes sure this continues. But yes, this should essentially be a left_join unless metadata is messed up.

analyses/snv-callers/util/tmb_functions.R

jashapiro · 2020-04-08T16:00:08Z

analyses/snv-callers/util/tmb_functions.R

+  # Calculate Tumor Mutational Burden a given sample in `Tumor_Sample_Barcode`
+  # given a target region BED frame. This function uses `snv_ranges_filter` to
+  # filter out SNVs outside those target BED ranges and uses the total size in
+  # bp of those BED ranges for the TMB denominator.


I don't have a problem with this, though personally I would probably just leave it to the user to tack on histology outside the function. Mostly because who knows what the next user will want (short_histology vs integrated_diagnosis for example). But the space used by including some things is small, so I don't think it maters. But I think including experimental_strategy makes sense, since it is integral to interpretation of the results.

analyses/snv-callers/util/tmb_functions.R

cansavvy · 2020-04-08T17:15:02Z

I'm going to run this on AWS. In the meantime I want to leave this PR open in case running the whole thing brings up some points that need to be tweaked.

cansavvy · 2020-04-13T15:15:55Z

This runs successfully and is ready to be merged. The only thing that needs to be updated for this is that part of the TMB figure (the part in the figure generation script) is not adherent to the unified color palette. I'm going to file a separate PR to address this.

cansavvy added 5 commits April 3, 2020 09:18

Copies of old stuff that I can take pieces from

55c0a0b

BED ranges set up

a0e47c1

It's mostly working!

0eb4703

Few tinier edits before testing

dd816ae

Change back to normal files

c58bafc

cansavvy added the work in progress Used to label (non-draft) pull requests that are not yet ready for review label Apr 3, 2020

cansavvy added 2 commits April 3, 2020 14:04

Get rid of old scripts.

b573b10

Development remnant

19219f3

cansavvy commented Apr 3, 2020

View reviewed changes

cansavvy added 5 commits April 3, 2020 14:14

Run styler on it all!

e573f51

Change back the indexing change that apparently HGG was dependent on

c31084e

Retire old analyses from CircleCI

c09c604

Merge remote-tracking branch 'cbethell/master' into tcga-bed-files

691aaab

Add back TCGA processing

d5cfccf

cansavvy marked this pull request as ready for review April 6, 2020 15:46

cansavvy requested a review from cbethell April 6, 2020 15:51

cansavvy removed the work in progress Used to label (non-draft) pull requests that are not yet ready for review label Apr 6, 2020

cbethell reviewed Apr 7, 2020

View reviewed changes

jashapiro approved these changes Apr 8, 2020

View reviewed changes

cansavvy added 2 commits April 8, 2020 13:06

Incorporate @jashapiro doc and tidy suggestions

dd8fc15

Add TODOs to things we won't do just yet.

a2a53e1

cansavvy added 4 commits April 9, 2020 09:01

genome_size -> region_size

6dd2677

Write down 256GB to figures README

da4a24c

Drop Panel samples in calculate TMB

cc0d2e5

Use a better fix for the Panel problem

a2aacc4

cansavvy added 2 commits April 13, 2020 11:16

Merge branch 'master' into tcga-bed-files

17377cb

Refresh plot

777f617

jaclyn-taroni merged commit 66bb67a into AlexsLemonade:master Apr 15, 2020

jaclyn-taroni mentioned this pull request May 26, 2020

Updated analysis: TCGA TMB BED files update v16 #668

Closed

cansavvy deleted the tcga-bed-files branch August 13, 2020 11:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconfigure how SNV-callers uses BED files for TMB calculations #671

Reconfigure how SNV-callers uses BED files for TMB calculations #671

cansavvy commented Apr 3, 2020 •

edited

Loading

cansavvy Apr 3, 2020

cbethell Apr 7, 2020

cbethell left a comment •

edited

Loading

cbethell Apr 7, 2020

cbethell Apr 7, 2020

cansavvy Apr 7, 2020

cbethell Apr 7, 2020

jashapiro Apr 8, 2020

jashapiro left a comment

jashapiro Apr 8, 2020

cansavvy Apr 8, 2020

jashapiro Apr 8, 2020

cansavvy commented Apr 8, 2020

cansavvy commented Apr 13, 2020

Reconfigure how SNV-callers uses BED files for TMB calculations #671

Reconfigure how SNV-callers uses BED files for TMB calculations #671

Conversation

cansavvy commented Apr 3, 2020 • edited Loading

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbethell left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cansavvy commented Apr 8, 2020

cansavvy commented Apr 13, 2020

cansavvy commented Apr 3, 2020 •

edited

Loading

cbethell left a comment •

edited

Loading