-
Notifications
You must be signed in to change notification settings - Fork 67
PR 1 of 3 TMB revamp: Separate TMB calculations as "coding only" and "all" #307
PR 1 of 3 TMB revamp: Separate TMB calculations as "coding only" and "all" #307
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. The main question I have is the TMB coding definition. I am sure you don't want to weigh into those definitions again, but it would seem logical that the denominator there would be the total coding length, not the sequenced length. Am I mistaken there? If not, you will need to add some logic to recalculate the denominator, but that could be combined with the target bed files to filter to just the coding mutations, without having to look at the Variant_Classification
column at all.
wgs_genome_size <- sum(wgs_bed[, 3] - wgs_bed[, 2]) | ||
wxs_exome_size <- sum(wxs_bed[, 3] - wxs_bed[, 2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need different bed files for coding regions? Intersect the gtf codons with the wgs and wxs bed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Yea, I hadn't thought of this. It should change. I think what your describing above is what we should do for the coding only
TMBs, but for the all mutations
we should use the intersection of mutect and strelka's BED files and add that up. Does that sound right to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does sound correct. I hadn't thought about that either... But yes, we can't call mutations outside the intersection of the two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up question is, what gtf file do we use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lancet methods list Gencode 31 as being used but I don't see the other callers list what Gencode version. I will dig deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gencode 27 is what is included in the data download at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other callers didn't restrict to coding regions, so no GTFs for those...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per our in person discussion, I will make the TMB denominator fixes in a subsequent PR so this PR doesn't get too out of control.
* initial restructure * Add in info from SNV consensus README * Update outdated text * Get rid of unnecessary 00-setup.R script * Take out 00-setup call from bash script * Add output summary, streamline wording * Make @jashapiro suggestions except MNV description * Updates to README to reflect #307 s changes * clarify Lancet statement and put link * Update mutation comparison explanation * Update analyses/snv-callers/README.md Co-Authored-By: jashapiro <jashapiro@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good (pending PRs 2 and 3)
Purpose/implementation Section
What scientific question is your analysis addressing?
Currently the TMB stats calculated use all mutations, but we should also have a separate TMB stats for coding mutations only.
What was your approach?
Strelka and Mutect are only used for the "all mutations" TMB, but the consensus file of Strelka, Lancet, and Mutect is used for the "coding only" mutations this was discussed on #305
This required some borrowing from the 02-merge-callers, and functionalization of
split_mnv
into its own script since it is now used by more than one scriptI'm going to add the doc changes that are associated with this to PR #304
What GitHub issue does your pull request address?
#305
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
-Is the original data handling flow of MNVs and SNVs maintained? I am particularly worried about the integrity of the last chunk of data wrangling on
03-calculate_tmb.R
.Variant Classification
groups ofIGR
andSilent
the only ones that are considered as non-coding?Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
No new tables yet, but this will require some downstream analyses like tmb_compare_tcga to be updated. Potentially mutational_signatures should be updated as well?
Results
What types of results are included (e.g., table, figure)?
No new results yet.
Reproducibility Checklist
No new packages were needed here.