Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Reconfigure how SNV-callers uses BED files for TMB calculations #671

Merged
merged 20 commits into from
Apr 15, 2020

Conversation

cansavvy
Copy link
Collaborator

@cansavvy cansavvy commented Apr 3, 2020

Purpose/implementation Section

What scientific question is your analysis addressing?

Same scientific question, but now setting up for the revised BED files that have been added to the v16 release for TCGA data.
Because these BED files aren't assigned to biospecimen's soley by their experimental strategy, this means I had to re-structure how 03-calculate_tmb.R script processes target region BED files.

What was your approach?

  1. I reconfigured the calculate_tmb function to be set up for processing one biospecimen at a time (based on the Tumor_Sample_Barcode column. (Previously it calculated TMB for the whole MAF at once).

  2. Reconfigured the associated 03-calculate-tmb.R script to set up a column that specifies what BED file goes to which sample as a part of the metadata data frame called target_bed. This also creates an associated column based on the root_directory that has the full file path (called target_bed_path).

  3. Used the target_bed column to get down to the unique BED files associated with the whole set of samples and then read in each BED file and convert to a named list of Genomic Ranges. There is a second list made from this list which takes the intersection of these BED files and the coding regions BED file which is a new optparse passed to the script (this is used for the coding_tmb calculations).

  4. For each sample's, the calculate_tmb function takes the tumor_sample_barcode, the original MAF file, and the name of the associated BED file. This function does the same filtering out of mutations outside of the BED range (from either the coding BED list or the original BED list), and then counts the number of remaining mutations / the total size of the BED ranges provided.

What GitHub issue does your pull request address?

#668

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

  • Are the BED ranges being handled as they should be? i.e. are there any places where something seems fishy? Perhaps a reduce needs to be used, or any parameters should be altered? (Keep in mind previously intersect with coding regions step was being completed by bedtools and now it is being done in R).

  • In the calculate_tmb function, I have it carry over the experimental_strategy and short_histology columns which is fine for the PBTA and TCGA data, but if we use this elsewhere, that will throw an error. Should I take that out of the function and just do a join after the calculation? Or is this fine for now?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

No.

Results

What types of results are included (e.g., table, figure)?

No results yet. Once I get the code itself approved, I will need to run everything on AWS and get a new version of the TMB files as well as a new figure 2.

What is your summary of the results?

That will be in a subsequent PR when I run everything from soup to nuts and obtain a refreshed mutational landscape figure (fig2).

Reproducibility Checklist

These items were not affected by these changes.

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

Need to do more updates the analysis README yet.

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@cansavvy cansavvy added the work in progress Used to label (non-draft) pull requests that are not yet ready for review label Apr 3, 2020
dplyr::select(Tumor_Sample_Barcode, target_bed, target_bed_path) %>%
dplyr::distinct()

# Pull out file paths as a vector
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a shorter, less confusing way to set up this section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if there is a shorter way, but it is not confusing as if there aren't any alternative suggestions it will be fine to leave it as is.

@cansavvy cansavvy marked this pull request as ready for review April 6, 2020 15:46
@cansavvy cansavvy requested a review from cbethell April 6, 2020 15:51
@cansavvy cansavvy removed the work in progress Used to label (non-draft) pull requests that are not yet ready for review label Apr 6, 2020
Copy link
Contributor

@cbethell cbethell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cansavvy this looks great!

Thank you for the detailed outline of your approach, it was very helpful in my case (seeing as I am not as familiar with this analysis).

I have a few questions and suggestions below re some points that can use clarification.
Beyond these, after going through the function and and main R script code, I do not have any structural changes to suggest at this time.

dplyr::select(Tumor_Sample_Barcode, target_bed, target_bed_path) %>%
dplyr::distinct()

# Pull out file paths as a vector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if there is a shorter way, but it is not confusing as if there aren't any alternative suggestions it will be fine to leave it as is.

# Calculate Tumor Mutational Burden a given sample in `Tumor_Sample_Barcode`
# given a target region BED frame. This function uses `snv_ranges_filter` to
# filter out SNVs outside those target BED ranges and uses the total size in
# bp of those BED ranges for the TMB denominator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the calculate_tmb function, I have it carry over the experimental_strategy and short_histology columns which is fine for the PBTA and TCGA data, but if we use this elsewhere, that will throw an error.

I like that this is a function so I would opt to make a note of this particular caveat in the function description here.

However, @jashapiro may have different feelings on this so you may want to wait to hear his thoughts on it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My quote there was trying to ask whether I should leave out those column references from this function and tack on those columns outside this function. Is that what you are talking about @cbethell ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a problem with this, though personally I would probably just leave it to the user to tack on histology outside the function. Mostly because who knows what the next user will want (short_histology vs integrated_diagnosis for example). But the space used by including some things is small, so I don't think it maters. But I think including experimental_strategy makes sense, since it is integral to interpretation of the results.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, with a few minor tweaks, mostly documentation and a few tweaks for clarity. I think the logic all works properly, though I admit I did not test it carefully.

.circleci/config.yml Outdated Show resolved Hide resolved
.circleci/config.yml Outdated Show resolved Hide resolved
analyses/snv-callers/scripts/03-calculate_tmb.R Outdated Show resolved Hide resolved
@@ -242,63 +225,152 @@ if (opt$tcga) {
# Add in metadata
strelka_mutect_maf_df <- strelka_mutect_maf_df %>%
dplyr::inner_join(metadata %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does anything get lost where we don't have matching rows "Tumor_Sample_Barcode"s? (i.e. is the inner_join equivalent to left_join here?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hope we don't have samples in the MAF files that are not in metadata, but in case that were to happen, an inner_join makes sure this continues. But yes, this should essentially be a left_join unless metadata is messed up.

analyses/snv-callers/util/tmb_functions.R Outdated Show resolved Hide resolved
# Calculate Tumor Mutational Burden a given sample in `Tumor_Sample_Barcode`
# given a target region BED frame. This function uses `snv_ranges_filter` to
# filter out SNVs outside those target BED ranges and uses the total size in
# bp of those BED ranges for the TMB denominator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a problem with this, though personally I would probably just leave it to the user to tack on histology outside the function. Mostly because who knows what the next user will want (short_histology vs integrated_diagnosis for example). But the space used by including some things is small, so I don't think it maters. But I think including experimental_strategy makes sense, since it is integral to interpretation of the results.

analyses/snv-callers/util/tmb_functions.R Outdated Show resolved Hide resolved
analyses/snv-callers/util/tmb_functions.R Outdated Show resolved Hide resolved
analyses/snv-callers/util/tmb_functions.R Outdated Show resolved Hide resolved
@cansavvy
Copy link
Collaborator Author

cansavvy commented Apr 8, 2020

I'm going to run this on AWS. In the meantime I want to leave this PR open in case running the whole thing brings up some points that need to be tweaked.

@cansavvy
Copy link
Collaborator Author

This runs successfully and is ready to be merged. The only thing that needs to be updated for this is that part of the TMB figure (the part in the figure generation script) is not adherent to the unified color palette. I'm going to file a separate PR to address this.

@jaclyn-taroni jaclyn-taroni merged commit 66bb67a into AlexsLemonade:master Apr 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants