Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Updated analysis: SNV consensus calling (for things other than substitutions) #990

Closed
jashapiro opened this issue Apr 9, 2021 · 29 comments

Comments

@jashapiro
Copy link
Member

What analysis module should be updated and why?

snv-callers, and in particular the generation of the consensus MAF file.

What changes need to be made? Please provide enough detail for another participant to make the update.

As noted in another review (#961 (review)), the current SNV consensus script may not handle combining indel variants correctly. Matching of individual calls across callers is currently performed only on the basis of start location, on the erroneous assumption that all calls are a single nucleotide in length. This is true for SNVs (and MNVs, all of which are reduced to SNVs before merging), but does not apply to the small indels that might occur.

INDEL variants should probably be treated entirely separately from substitutions, but the logic of how to do that is something that requires some investigation, as there are many possible edge cases.

  • The simple case is that start and end positions match across all three callers. Those we are almost certainly capturing, but they are probably being mixed up a bit with cases where...

  • Start matches, but callers choose different end positions. (And vice versa). These should be rare, but are certainly possible if one caller decides a deletion is 5 bp and another 6bp, for example. I guess I would go with majority rule among the callers, if possible, in this case?

  • Overlapping deletion calls with no shared start and end. I expect this to happen in cases where the first base(s) of a deletion is repeated at the end: For example, a CATATAT -> CAT deletion could start at position 2 or position 4 and different callers might make different, equally correct, decisions. The nice thing in this case is that they should match as to the reference and alt alleles. (Both are ATAT -> _, but this might not always be the case in situations like this.

  • 'Overlapping' insertions could be quite difficult to detect! A change of CAT -> CAAAT might put the inserted bases before or after the original A, and I don't know that we would be able to easily identify that...

This is not the complete list of edge cases.

We should start by pulling out all indel calls from strelka, mutect and lancet so we can investigate the properties a bit more carefully, then decide on a set of heuristics to guide the generation of an INDEL consensus.

I will also note that one option is to simply "trust" one caller for INDELs and use its calls exclusively, avoiding many of these potential issues. This has the advantage of being easy, but has obvious downsides as well!

What input data should be used? Which data were used in the version being updated?

maf files from strelka, mutect and lancet.

When do you expect the revised analysis will be completed?

1-2 weeks? Depends on results of initial investigations, and discussions about methods.

Who will complete the updated analysis?

@jashapiro, hopefully with input/discussion from others! (Tagging @kgaonkar6 in particular, as we will want to implement similar logic in #961 )

@jashapiro jashapiro self-assigned this Apr 9, 2021
@kgaonkar6
Copy link
Collaborator

kgaonkar6 commented Apr 13, 2021

I hadn't thought about all the possible representation for indels , but this is a great point of interest for consensus calling! I looked at what people have done before and found this normalization code from bcbio:
https://github.com/hongenxu/bcbio-nextgen/blob/master/bcbio/variation/normalize.py. They also do consensus ( which they call ensembl) calling and normalize their vcf calls by:

  1. Split multi-allelic variants into single sample records.
    Downstream tools like GEMINI don't handle multi-allelic variants.
    normalize splits these into multiple single allele variants.
    More generally, multi-allelic variants are tricky to represent in
    comparisons and storage. There are useful discussions on-going in the
    GA4GH about allele/count based approaches to handling this more
    generally:
    Multiple alts per variant? ga4gh/ga4gh-schemas#169
    There are multiple approaches to do the decomposition:
  2. Decompose biallelic block substitutions
    As part of normalization, decompose MNPs into SNPs, e.g. AG>CT -> A>C and G>T
    There are a few approaches:
  3. Normalizing indels
    Left-align and normalize indels, check if REF alleles match the reference.

I think we already do MNV to SNV in the consensus calling , in addition , they decompose biallelic substitutions and finally vt normalize to normalize the indel calls in vcf from all callers.

I also found some documentation and the pseudo-code here
image

Would this be something that can be of interest to run upstream before we process the variant calls in the repo to fix this issue of indel representation (and possibly will also fix MNVs upstream) or potentially use their pseudo-code to implement it for maf inputs?

Similar to the above bcbio normalization, normalization is also being developed as a step in our internal D3b consensus calling by @migbro as follows ( let me know if I missed something) :

He did share an output for the normalization are here's an example normalization a couple deletions below 🤔 .

Raw mutect2 maf

Hugo_Symbol Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID HGVSc HGVSp HGVSp_Short Transcript_ID Exon_Number t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count all_effects Allele Gene Feature Feature_type
LGR6 . GRCh38 chr1 202237984 202237984 + Intron DEL A A - rs113958266 NA BS_M5FM63EB BS_9H6Z0MEG A A NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA c.428+1991del NA NA ENST00000367278 NA 72 57 5 28 19 9 LGR6,intron_variant,,ENST00000255432,NM_021636.2;LGR6,intron_variant,,ENST00000367278,NM_001017403.1;LGR6,intron_variant,,ENST00000423542,;LGR6,intron_variant,,ENST00000439764,NM_001017404.1;LGR6,intron_variant,,ENST00000487787,;LGR6,downstream_gene_variant,,ENST00000503519,; C ENSG00000133067 ENST00000367278 Transcript
LGR6 . GRCh38 chr1 202239345 202239350 + Intron DEL GTGTGT GTGTGT - rs1460759624 NA BS_M5FM63EB BS_9H6Z0MEG GTGTGT GTGTGT NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA c.428+3352_428+3357del NA NA ENST00000367278 NA 68 58 2 16 15 1 LGR6,intron_variant,,ENST00000255432,NM_021636.2;LGR6,intron_variant,,ENST00000367278,NM_001017403.1;LGR6,intron_variant,,ENST00000423542,;LGR6,intron_variant,,ENST00000439764,NM_001017404.1;LGR6,intron_variant,,ENST00000487787,;LGR6,downstream_gene_variant,,ENST00000503519,; - ENSG00000133067 ENST00000367278 Transcript

Normalized mutect2 vcf in maf format

Hugo_Symbol Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Sample_UUID Matched_Norm_Sample_UUID HGVSc HGVSp HGVSp_Short Transcript_ID Exon_Number t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count all_effects Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation ALLELE_NUM DISTANCE STRAND_VEP SYMBOL SYMBOL_SOURCE HGNC_ID BIOTYPE
LGR6 . GRCh38 chr1 202237984 202237984 + Intron DEL A A - rs113958266 NA BS_M5FM63EB BS_9H6Z0MEG A A NA NA c.428+2005del NA NA ENST00000367278 NA 72 57 5 28 19 9 LGR6,intron_variant,,ENST00000255432,NM_021636.2;LGR6,intron_variant,,ENST00000367278,NM_001017403.1;LGR6,intron_variant,,ENST00000423542,;LGR6,intron_variant,,ENST00000439764,NM_001017404.1;LGR6,intron_variant,,ENST00000487787,;LGR6,downstream_gene_variant,,ENST00000503519,; - ENSG00000133067 ENST00000367278 Transcript intron_variant NA NA NA NA NA rs113958266 1 NA 1 LGR6 HGNC HGNC:19719 protein_coding
LGR6 . GRCh38 chr1 202237983 202237983 + Intron SNP C C A rs76646640 NA BS_M5FM63EB BS_9H6Z0MEG C C NA NA c.428+1990C>A NA NA ENST00000367278 NA 72 57 5 28 19 0 LGR6,intron_variant,,ENST00000255432,NM_021636.2;LGR6,intron_variant,,ENST00000367278,NM_001017403.1;LGR6,intron_variant,,ENST00000423542,;LGR6,intron_variant,,ENST00000439764,NM_001017404.1;LGR6,intron_variant,,ENST00000487787,;LGR6,downstream_gene_variant,,ENST00000503519,; A ENSG00000133067 ENST00000367278 Transcript intron_variant NA NA NA NA NA rs76646640 1 NA 1 LGR6 HGNC HGNC:19719 protein_coding
LGR6 . GRCh38 chr1 202237983 202237984 + Intron INS - - A rs5780111 NA BS_M5FM63EB BS_9H6Z0MEG - - NA NA c.428+2005dup NA NA ENST00000367278 NA 72 57 5 28 19 0 LGR6,intron_variant,,ENST00000255432,NM_021636.2;LGR6,intron_variant,,ENST00000367278,NM_001017403.1;LGR6,intron_variant,,ENST00000423542,;LGR6,intron_variant,,ENST00000439764,NM_001017404.1;LGR6,intron_variant,,ENST00000487787,;LGR6,downstream_gene_variant,,ENST00000503519,; A ENSG00000133067 ENST00000367278 Transcript intron_variant NA NA NA NA NA rs5780111 1 NA 1 LGR6 HGNC HGNC:19719 protein_coding
LGR6 . GRCh38 chr1 202239345 202239350 + Intron DEL GTGTGT GTGTGT - rs1460759624 NA BS_M5FM63EB BS_9H6Z0MEG GTGTGT GTGTGT NA NA c.428+3387_428+3392del NA NA ENST00000367278 NA 68 58 2 16 15 1 LGR6,intron_variant,,ENST00000255432,NM_021636.2;LGR6,intron_variant,,ENST00000367278,NM_001017403.1;LGR6,intron_variant,,ENST00000423542,;LGR6,intron_variant,,ENST00000439764,NM_001017404.1;LGR6,intron_variant,,ENST00000487787,;LGR6,downstream_gene_variant,,ENST00000503519,; - ENSG00000133067 ENST00000367278 Transcript intron_variant NA NA NA NA NA rs1460759624 1 NA 1 LGR6 HGNC HGNC:19719 protein_coding
LGR6 . GRCh38 chr1 202239345 202239346 + Intron DEL GT GT - rs1392575432 NA BS_M5FM63EB BS_9H6Z0MEG GT GT NA NA c.428+3391_428+3392del NA NA ENST00000367278 NA 68 58 8 16 15 0 LGR6,intron_variant,,ENST00000255432,NM_021636.2;LGR6,intron_variant,,ENST00000367278,NM_001017403.1;LGR6,intron_variant,,ENST00000423542,;LGR6,intron_variant,,ENST00000439764,NM_001017404.1;LGR6,intron_variant,,ENST00000487787,;LGR6,downstream_gene_variant,,ENST00000503519,; - ENSG00000133067 ENST00000367278 Transcript intron_variant NA NA NA NA NA rs1392575432 1 NA 1 LGR6 HGNC HGNC:19719 protein_coding

@jashapiro
Copy link
Member Author

Thanks @kgaonkar6! I think you see some of the issues I was thinking about. I think the MNV issue should be fine; we were sort of saved by the fact that strelka2 does not call any MNVs, so we had to decompose the others right off. But INDELs remain an issue that I guess we never really resolved.

It does seem to me that applying a normalization algorithm to all of the vcf files would be worth doing for these data, if it was not done already.

That said, I'm definitely confused by some of the output from normalization you are presenting from @migbro.

  • The first site was something I also noticed with mutect2 (and only mutect2) where it would call a DEL but the 'Allele' column reflected a SNP (which is not the same as either Tumor_Seq_Allele) Where does the C come from there? Is this data that was in the VCF but did not make it to the MAF? If so, why do they all end up with the same level of support as shown in t_ref and t_alt?) This may be a mutect2 specific thing? I will also note that mutect2 seems to have an odd preference for even sized deletions: it was the only caller with more 2 nt deletions than 1 nt, so I kind of don't trust it and worry there may have been some underlying bugs.
  • Similarly, I don't understand why the second deletion is being split into two overlapping deletions. In this case there is at least different levels of support for the two introns, so can I assume there is some data in the VCF that caused this which we could not see in the MAF?

If the answers to my confusion involve the original VCF files, then it seems like we will have to have you do a normalization step, which would potentially improve some of the indel consensus calls.

Once that is done we would probably want to do some quick analysis of the hotspot calls to see what effects requiring 2 or 3 callers to agree might have. I remain concerned about including variants even in the "scavenged" set that are only called by
a single caller, especially if that caller is VarDict.

@jharenza
Copy link
Collaborator

If the answers to my confusion involve the original VCF files, then it seems like we will have to have you do a normalization step, which would potentially improve some of the indel consensus calls.

@migbro and @yuankunzhu - do we have this performed pre-d3b consensus?

@migbro
Copy link
Contributor

migbro commented Apr 13, 2021

Not yet. I currently have in review a workflow to bring our vcfs to a new spec. This means the following will happen:

  • Normalize the vcf using the above described method
  • Run annotation subwf:
    • add standard FORMAT fields: ONLY FOR STRELKA2 (GT, AF, AD). This helps for downstream compatibility for some tools
    • VEP annotate
    • Annotate using bcftools to add AF from gnomad (the reference that is used for mutect2)
    • Add soft filter, recommend AF < 0.001 and norm sample DP <= 7
    • Annotate hotspots (protected vcf final)
    • Run vcf2maf (protected maf final)
    • Hard filter selecting PASS or HotSpotAllele=1 (public vcf final)
    • Run vcf2maf on that output for public version (public maf final)
      We hope to have this reviewed and approved by the end of the week. After which the ops team will test a small subset to shake out any unexpected kinks, double check inputs. Then we will run on all PBTA. Run time should be around 20 minutes per caller, per sample. So a rosy estimate once we feel good about going full throttle:
      (1000 sample x 20 minutes)/700 concurrent jobs = less than an hour. Then adding metadata to files...prolly half a day maybe? Things like file export to bucket and KF portal registration are longer and I cannot predict, but if you can work off of cavatica until then, should be good.

@migbro
Copy link
Contributor

migbro commented Apr 13, 2021

Ah! I knew that estimate was a little too good, multiply that by 4. so more like 4 hours is the rosy estimate to generate outputs start-to-finish.

@jashapiro
Copy link
Member Author

Thanks for your work on this @migbro! I noted elsewhere, but apparently not in this issue that we will also need to update the TCGA set that we are using for comparison here. So the time estimate is going to be even longer, unfortunately.

@migbro
Copy link
Contributor

migbro commented Apr 15, 2021

Sure thing. The PR is approved, and @zhangb1 and @yuankunzhu are discussing today how to organize and run this request. As for adding TCGA to that request, I think we'd have to meet to see if we can make that happen, and we'd to loop in someone like @jharenza or someone else who was familiar with what inputs Teja used when she worked with these file sto see if it's doable.

@migbro
Copy link
Contributor

migbro commented Apr 22, 2021

Ok, a quick update on this. TCGA is done in terms of normalizing vcfs and creating per-caller maf files. There are two sets of outputs for each caller, one considered "protected" the other "public". This is because of a recent effort to attempt to employ "germline masking." For us, that entails dropping calls that have a norm depth < 8 and/or gnomad AD > 0.001 for the public version (except hotspots). Hotspots using v2 cancer hotspots are also marked, you can of course add more or just ignore them. The protected version will just have a soft filter - meaning calls meeting that criteria just have a not in the FILTER field, but are NOT dropped. I am not sure of what comes next, but the files are on cavatica here: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/openpbta-tcga-ad-hoc-caller-rerun/
and I can add whomever needs to be added for the next steps.
As for PBTA, that normalization is in progress and if all goes well, it should be done by end of day. Then the ops team will add metadata to those files.

As for this protected vs public thing, D3b, for our public cavatica projects and provisional pedcbioportal projects will be using those public outputs. I am not sure if openPBTA is expected/encouraged to follow suit, but just putting it out there in case someone is wondering why there are two sets of outputs, etc. No one seems to have thought of the ramifications for groups outside of D3b yet, or given me any guidance on that 😬 Maybe we ask Adam R. about it?

@migbro
Copy link
Contributor

migbro commented Apr 23, 2021

Ok, these files:

-rw-rw-r-- 1 ubuntu ubuntu 202M Apr 23 21:57 pbta-snv-lancet.vep.maf.gz
-rw-rw-r-- 1 ubuntu ubuntu 772M Apr 23 22:01 pbta-snv-mutect2.vep.maf.gz
-rw-rw-r-- 1 ubuntu ubuntu 632M Apr 23 21:33 pbta-snv-strelka2.vep.maf.gz
-rw-rw-r-- 1 ubuntu ubuntu 268M Apr 23 21:58 pbta-snv-vardict.vep.maf.gz
-rw-rw-r-- 1 ubuntu ubuntu  15M Apr 23 21:36 pbta-tcga-snv-lancet.vep.maf.gz
-rw-rw-r-- 1 ubuntu ubuntu  13M Apr 23 21:52 pbta-tcga-snv-mutect2.vep.maf.gz
-rw-rw-r-- 1 ubuntu ubuntu  16M Apr 22 22:09 pbta-tcga-snv-strelka2.vep.maf.gz

have been uploaded here: https://s3.console.aws.amazon.com/s3/buckets/kf-openaccess-us-east-1-prd-pbta?prefix=data/ (as s3 uri: s3://kf-openaccess-us-east-1-prd-pbta/data/release-v19-20210423/)
I also updated the md5 files

These mafs are those public ones that I was talking about, as @jharenza has recommended that I merge those and add, which is nice to have that kind of synergy (-2 points for business jargon, I know!).
Qualitatively, you won't see too much dropped except for vardict. That got real slimmed down!
Please let me know if there are any issues!

@jharenza
Copy link
Collaborator

thank you both so much!!

@jashapiro
Copy link
Member Author

@migbro Can you clarify a couple of the added columns in the new MAF files? Specifically, what are MS and FETS? HotSpotAllele is pretty self explanatory.

There are also a few things that were removed, most notably Entrez IDs. I don't know if any packages use the Entrez IDs, but that is something we will have to look out for. (I note that validation & ExAC info is now removed as well, but those seemed to all be NA in the old data, so that seems totally reasonable!)

Noting also that we will need to update https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md with the new file format.

@jashapiro
Copy link
Member Author

jashapiro commented Apr 26, 2021

I assumed that the column names were consistent, but that doesn't seem to be the case!
MS and FETS are only in lancet
MQ, MQ0, and QSI are in strelka2
MBQ TLOD in mutect2
MSI MSILEN SOR SSF in vardict

Should the columns be in the MAF files at all, or can they be harmonized?

@migbro
Copy link
Contributor

migbro commented Apr 26, 2021

Hi @jashapiro , sure! So the added columns were columns that I subjectively thought might be helpful output that each caller does that others might not. As far as I know, the MAF spec does allow for users to add additional columns on top of the "core" set. Descriptions are from the vcf headers below
Lancet:
MS : Microsatellite mutation (format: #LEN#MOTIF)
FETS: phred-scaled p-value of the Fisher's exact test for tumor-normal allele counts (right-sided)
Strelka2:
MQ: RMS Mapping Quality
MQ0: Total Mapping Quality Zero Reads
QSI: Quality score for any somatic variant, ie. for the ALT haplotype to be present at a significantly different frequency in the tumor and normal
Mutect2:
MBQ: median base quality
TLOD: Log odds ratio score for variant
VarDict:
MSI: MicroSatellite. > 1 indicates MSI
MSILEN: MSI unit repeat length in bp
SOR: Odds ratio
SSF: P-value
They don't have to be in there, but there aren't any plans to create an version without it. I apologize if it causes some trouble. I had asked in separate ticket internal ticket if your consensus method would/could still work with these updates and I was assured they would.
If it helps, that all appear at the end so all the preceding columns are the same, so the additional columns all start right after vcf_pos in the maf

@jashapiro
Copy link
Member Author

jashapiro commented Apr 26, 2021

Thanks @migbro! Yeah, there are some small updates to the consensus scripts that I needed to make, but nothing major. I am dropping those columns for the consensus, but I don't think there is any problem leaving them in the individual caller files, as long as they are documented.

@jashapiro
Copy link
Member Author

@migbro One followup: it does seem that the Entrez_Gene_Id is standard based on https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/

As I said, I don't expect to use it, but if the column is expected by others it could definitely cause some trouble at times. Was this removal intentional?

@migbro
Copy link
Contributor

migbro commented Apr 26, 2021

Hi @jashapiro , yeah, that's an interesting point. So, it was intentional because we haven't been filling it in. I felt comfortable dropping it following what cBioPortal deemed as minimal for a maf https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#minimal-maf-format, which is admittedly, not much! I can ping Kyle Hernandez from U Chicago to get his opinion on it, who was heavily involved in creating the GDC spec

@migbro
Copy link
Contributor

migbro commented Apr 26, 2021

@jashapiro , yes, you are right that is a required column (one of 30), and Kyle has pointed me to a definitive source for that. I will do the following today:

  • I will ensure that all required columns exists
  • Test output with the maf tools R package
  • Update openPBTA mafs
    Hopefully you can still work with what is there while I do this. Is it ok to overwrite the files once I have the new ones?

@jashapiro
Copy link
Member Author

@migbro Thanks for looking into it! Yes, it should be fine to overwrite the files. Just let me know when the new ones are up!

@migbro
Copy link
Contributor

migbro commented Apr 27, 2021

@jashapiro it's up!

@jashapiro
Copy link
Member Author

@jharenza The consensus files based on the updated MAFs are now uploaded to https://open-pbta.s3.us-east-1.amazonaws.com/data/snv-consensus/snv-consensus-20210427.zip

I have not checked them in detail, but they seem to be about the right size: a bit smaller than previously, but based on some of the shrinkage in the input files, that seems reasonable?

I hadn't fully processed how much VarDict results shrank! From 4.1 GB to ~270 MB!!! @migbro Am I understanding correctly that this is the result of on filtering on sequencing depth and population frequency? I'm kind of in shock that the change was that dramatic. I wouldn't have expected it to fall below all other callers from where it started, since I would expect that those filters should be the same across callers by site, and any true VarDict calls would be in at least one other caller.

If we weren't so far down the line in analysis, I might consider adding it back to the consensus... I'm running the caller comparison notebook now though, so that may give some insight. I don't actually think the change will be hard, so we might think about it for v20, depending on results.

@migbro
Copy link
Contributor

migbro commented Apr 27, 2021

@jashapiro I know right?! I took a peek at the CBTN portion, and summarized the FILTER events:

40961568 GNOMAD_AF_HIGH
1387325 GNOMAD_AF_HIGH;NORM_DP_LOW
  40598 NORM_DP_LOW
2434899 PASS

High gnomad > 0.001 filter is def the big culler of variants.

@jashapiro
Copy link
Member Author

@jashapiro I know right?! I took a peek at the CBTN portion, and summarized the FILTER events:

40961568 GNOMAD_AF_HIGH
1387325 GNOMAD_AF_HIGH;NORM_DP_LOW
  40598 NORM_DP_LOW
2434899 PASS

High gnomad > 0.001 filter is def the big culler of variants.

Yeah, but what I don't understand is why so many got culled from vardict that didn't get culled other places. Here is an upset plot from the previous data:
image
And with the new data:
image

(Other comparisons can be seen here: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1033/files?short_path=726dc81#diff-726dc818248c6115d51dc51d7f200a1a19b23ecceaf77365ddb2c993057ad0e9)

If we focus on Strelka2, mutect2 and VarDict, we see that before they shared ~3.7 million calls. After, they share only 460K. So it seems like calls are being removed from VarDict that are surviving in mutect & Strelka2. This seems really odd to me, because those should all share the same gnomad frequencies.

By contrast, the mutect + strelka set goes up from 500K to 3.4 million, again suggesting that there are things lost from only VarDict.

Notably, the mutect + strelka + lancet set (our current consensus) stays about the same, so this won't have a big downstream effect at the moment. This makes me less inclined to revisit the consensus criteria.

@migbro
Copy link
Contributor

migbro commented Apr 27, 2021

Thanks for looking into this! In looking at the pre-pass file from vardict on a hunch, it seems that VarDict has it's own INFO/AF field, unrelated to gnomad AF tagging: ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">. If I had to guess, what isn't getting overwritten by gnomad AF annotation remains. Oh boy, quite the frustration. Luckily, or logically, the other callers don't have this INFO/AF field, so that's why they look reasonable.

@migbro
Copy link
Contributor

migbro commented Apr 27, 2021

So in short, I'd say your current consensus method is probably safe since you are excluding VarDict. In the meantime, I will modify how the annotation is done for VarDict first, but generally will likely strip any existing INFO/AF field and replace in so that only the intended INFO/AF field from gnomad is populated. Thanks again!

@migbro
Copy link
Contributor

migbro commented Apr 27, 2021

A quick update, so after adding in a pre-strip INFO step, the numbers look more "sane." A single sample example:
Vardict pre-strip:

  42331 GNOMAD_AF_HIGH
    349 GNOMAD_AF_HIGH;NORM_DP_LOW
     11 NORM_DP_LOW
   2179 PASS

Post strip:

  10661 GNOMAD_AF_HIGH
    101 GNOMAD_AF_HIGH;NORM_DP_LOW
    259 NORM_DP_LOW
  33849 PASS

For comparison, strelka2:

   1692 GNOMAD_AF_HIGH
     18 GNOMAD_AF_HIGH;NORM_DP_LOW
     34 NORM_DP_LOW
   3651 PASS

I'll re-run vardict with this new wf and update the vardict maf when completed.

@jashapiro jashapiro mentioned this issue Apr 27, 2021
5 tasks
@migbro
Copy link
Contributor

migbro commented Apr 28, 2021

Ok, vardict has been updated

@kgaonkar6
Copy link
Collaborator

thanks @migbro! I'll create the CI files for vardict.

@jashapiro just wanted to check if these tmb files in the link https://open-pbta.s3.us-east-1.amazonaws.com/data/snv-consensus/snv-consensus-20210427.zip should be renamed to the file naming convention used in v18?

filenames currrent link
pbta-snv-mutation-tmb-all.tsv
pbta-snv-mutation-tmb-coding.tsv

filenames in v18:
pbta-snv-consensus-mutation-tmb-all.tsv
pbta-snv-consensus-mutation-tmb-coding.tsv

It seems to me the CI script will still work since it looks for all pbta-snv* with the current names though.

@jashapiro
Copy link
Member Author

@jashapiro just wanted to check if these tmb files in the link https://open-pbta.s3.us-east-1.amazonaws.com/data/snv-consensus/snv-consensus-20210427.zip should be renamed to the file naming convention used in v18?

filenames currrent link
pbta-snv-mutation-tmb-all.tsv
pbta-snv-mutation-tmb-coding.tsv

filenames in v18:
pbta-snv-consensus-mutation-tmb-all.tsv
pbta-snv-consensus-mutation-tmb-coding.tsv

It seems to me the CI script will still work since it looks for all pbta-snv* with the current names though.

Yes, we should keep the file names the same as before! Sorry about that: I didn't confirm that the script output file names matched.

@jashapiro
Copy link
Member Author

With the changes to normalized vcfs and updated files associated with the release of v19 #1026, I believe this can be closed. While it is possible that more improvements could be made, I think that would require substantial time and effort to fully assess and implement.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants