Updated analysis: function `calculate_tmb` returns unfiltered mutation counts #724

yuankunzhu · 2020-07-13T12:34:25Z

What analysis module should be updated and why?

tmb_functions.R from snv-callers/scripts.
it returns all mutation counts from input MAF without filtering which caused high mutation counts for WGS coding region output
a quick look of the tmb tsv files, it shows mutation_count in the coding region tsv the same number as the total mutation count:

pbta-snv-mutation-tmb-
$ head pbta-snv-mutation-tmb-*.tsv | column -s $'\t' -t
==> pbta-snv-mutation-tmb-all.tsv <==
Tumor_Sample_Barcode                      experimental_strategy  short_histology  mutation_count  region_size  tmb
BS_1Q524P3B                               WGS                    HGAT             4168            2923762389   1.4255604407804017
BS_BD4RQ1G0                               WGS                    Schwannoma       241             2923762389   0.0824280389222833
BS_80X7AVCP                               WGS                    Embryonal Tumor  708             2923762389   0.24215374090031774
BS_1JMTTKMK                               WGS                    Medulloblastoma  3116            2923762389   1.0657500800076132
BS_9DM8H1RX                               WGS                    HGAT             638             2923762389   0.218211986856501
BS_4XPPZTGG                               WGS                    Other            245             2923762389   0.08379613915335854
BS_3J4T2YYW                               WGS                    Medulloblastoma  326             2923762389   0.11150016883263218
BS_A1DV9T7G                               WGS                    Neurofibroma     4357            2923762389   1.4902031766987067
BS_X7QJCVJB                               WGS                    Medulloblastoma  1992            2923762389   0.6813139150754702
==> pbta-snv-mutation-tmb-coding.tsv <==
Tumor_Sample_Barcode                      experimental_strategy  short_histology  mutation_count  region_size  tmb
BS_1Q524P3B                               WGS                    HGAT             4168            35717401     116.69382103137907
BS_BD4RQ1G0                               WGS                    Schwannoma       241             35717401     6.747411436795191
BS_80X7AVCP                               WGS                    Embryonal Tumor  708             35717401     19.822270942950187
BS_1JMTTKMK                               WGS                    Medulloblastoma  3116            35717401     87.24039019524405
BS_9DM8H1RX                               WGS                    HGAT             638             35717401     17.86244189491839
BS_4XPPZTGG                               WGS                    Other            245             35717401     6.8594016681112935
BS_3J4T2YYW                               WGS                    Medulloblastoma  326             35717401     9.127203852262374
BS_A1DV9T7G                               WGS                    Neurofibroma     4357            35717401     121.98535946106492
BS_X7QJCVJB                               WGS                    Medulloblastoma  1992            35717401     55.77113519541917

more detailed investigation: https://www.notion.so/d3b/OpenPBTA-TMB-issue-investigation-07-2020-3698b1f726a44eb8a521d043321acf34

What changes need to be made? Please provide enough detail for another participant to make the update.

preliminary investigation shows the mutation_count should be calculated with filt_maf_df in the code below, but need a closer look and test.

OpenPBTA-analysis/analyses/snv-callers/util/tmb_functions.R

Lines 132 to 144 in 66bb67a

    
           tmb <- sample_maf_df %>% 
        
             dplyr::group_by( 
        
               #TODO: Make this column passing stuff more flexible with some tidyeval maybe 
        
               Tumor_Sample_Barcode = tumor_sample_barcode, 
        
               experimental_strategy, 
        
               short_histology 
        
             ) %>% 
        
             # Count number of mutations for that sample 
        
             dplyr::summarize( 
        
               mutation_count = dplyr::n(), 
        
               region_size = bed_size, 
        
               tmb = mutation_count / (region_size / 1000000) 
        
               )

What input data should be used? Which data were used in the version being updated?

# for PBTA
data/pbta-snv-strelka2.vep.maf.gz
data/pbta-snv-mutect2.vep.maf.gz
data/pbta-histologies.tsv

# for TCGA
data/pbta-tcga-snv-strelka2.vep.maf.gz
data/pbta-tcga-snv-mutect2.vep.maf.gz
data/pbta-tcga-manifest.tsv

# BED
data/gencode.v27.primary_assembly.annotation.gtf.gz
data/WXS.hg38.100bp_padded.bed
scratch/intersect_strelka_mutect_WGS.bed

When do you expect the revised analysis will be completed?

Who will complete the updated analysis?

The text was updated successfully, but these errors were encountered:

jaclyn-taroni · 2020-07-13T13:41:44Z

Excellent find @yuankunzhu - thank you! As discussed via Slack, @yuankunzhu will file the bug fix this afternoon and we can have @cansavvy and @jashapiro take a look. The CCDL team can look at some of the downstream analyses and fix the plotting outlined in that Notion document.

yuankunzhu · 2020-07-13T17:17:31Z

Excellent find @yuankunzhu - thank you! As discussed via Slack, @yuankunzhu will file the bug fix this afternoon and we can have @cansavvy and @jashapiro take a look. The CCDL team can look at some of the downstream analyses and fix the plotting outlined in that Notion document.

sounds good @jaclyn-taroni. just filed a simple PR around this at #727

cansavvy · 2020-07-17T14:59:56Z

Is there anything left to address for this issue or can we close it?

yuankunzhu · 2020-07-17T15:43:39Z

I think @jashapiro wants to make sure everything looks ok for the downstream analysis with the updated function from the commit here: #727 (review)?

But yea I'm ok to close it as my initial thought was just to arise the counting issue.

in addition to this, we could consider to open another ticket just for the plotting axes alignment with more specific descriptions if we identified that's an issue to address as well.

cansavvy · 2020-08-03T15:35:20Z

I have the axes alignment bit tracked here: cansavvy/openpbta-notebook-concept#9

So we can close this.

I can also copy over the issue I linked above to this current repository, but I didn't want to clutter up the issues here.

yuankunzhu added the updated analysis label Jul 13, 2020

yuankunzhu assigned jaclyn-taroni and cansavvy Jul 13, 2020

jashapiro mentioned this issue Jul 13, 2020

Updated analysis: filter to only non-synonymous mutations for TMB #726

Closed

yuankunzhu mentioned this issue Jul 13, 2020

🐛 use 'filt_maf_df' for 'mutation_count' in calculate_tmb() function #727

Merged

5 tasks

jaclyn-taroni removed their assignment Aug 3, 2020

cansavvy closed this as completed Aug 3, 2020

cansavvy mentioned this issue Jan 7, 2021

Updated analysis: PBTA vs TCGA TMB analysis #556

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated analysis: function `calculate_tmb` returns unfiltered mutation counts #724

Updated analysis: function `calculate_tmb` returns unfiltered mutation counts #724

yuankunzhu commented Jul 13, 2020 •

edited

Loading

jaclyn-taroni commented Jul 13, 2020

yuankunzhu commented Jul 13, 2020

cansavvy commented Jul 17, 2020

yuankunzhu commented Jul 17, 2020

cansavvy commented Aug 3, 2020

Updated analysis: function calculate_tmb returns unfiltered mutation counts #724

Updated analysis: function calculate_tmb returns unfiltered mutation counts #724

Comments

yuankunzhu commented Jul 13, 2020 • edited Loading

What analysis module should be updated and why?

What changes need to be made? Please provide enough detail for another participant to make the update.

What input data should be used? Which data were used in the version being updated?

When do you expect the revised analysis will be completed?

Who will complete the updated analysis?

jaclyn-taroni commented Jul 13, 2020

yuankunzhu commented Jul 13, 2020

cansavvy commented Jul 17, 2020

yuankunzhu commented Jul 17, 2020

cansavvy commented Aug 3, 2020

Updated analysis: function `calculate_tmb` returns unfiltered mutation counts #724

Updated analysis: function `calculate_tmb` returns unfiltered mutation counts #724

yuankunzhu commented Jul 13, 2020 •

edited

Loading