Updated analysis: SNV consensus calling (for things other than substitutions) #990

jashapiro · 2021-04-09T19:10:29Z

What analysis module should be updated and why?

snv-callers, and in particular the generation of the consensus MAF file.

What changes need to be made? Please provide enough detail for another participant to make the update.

As noted in another review (#961 (review)), the current SNV consensus script may not handle combining indel variants correctly. Matching of individual calls across callers is currently performed only on the basis of start location, on the erroneous assumption that all calls are a single nucleotide in length. This is true for SNVs (and MNVs, all of which are reduced to SNVs before merging), but does not apply to the small indels that might occur.

INDEL variants should probably be treated entirely separately from substitutions, but the logic of how to do that is something that requires some investigation, as there are many possible edge cases.

The simple case is that start and end positions match across all three callers. Those we are almost certainly capturing, but they are probably being mixed up a bit with cases where...
Start matches, but callers choose different end positions. (And vice versa). These should be rare, but are certainly possible if one caller decides a deletion is 5 bp and another 6bp, for example. I guess I would go with majority rule among the callers, if possible, in this case?
Overlapping deletion calls with no shared start and end. I expect this to happen in cases where the first base(s) of a deletion is repeated at the end: For example, a CATATAT -> CAT deletion could start at position 2 or position 4 and different callers might make different, equally correct, decisions. The nice thing in this case is that they should match as to the reference and alt alleles. (Both are ATAT -> _, but this might not always be the case in situations like this.
'Overlapping' insertions could be quite difficult to detect! A change of CAT -> CAAAT might put the inserted bases before or after the original A, and I don't know that we would be able to easily identify that...

This is not the complete list of edge cases.

We should start by pulling out all indel calls from strelka, mutect and lancet so we can investigate the properties a bit more carefully, then decide on a set of heuristics to guide the generation of an INDEL consensus.

I will also note that one option is to simply "trust" one caller for INDELs and use its calls exclusively, avoiding many of these potential issues. This has the advantage of being easy, but has obvious downsides as well!

What input data should be used? Which data were used in the version being updated?

maf files from strelka, mutect and lancet.

When do you expect the revised analysis will be completed?

1-2 weeks? Depends on results of initial investigations, and discussions about methods.

Who will complete the updated analysis?

@jashapiro, hopefully with input/discussion from others! (Tagging @kgaonkar6 in particular, as we will want to implement similar logic in #961 )

The text was updated successfully, but these errors were encountered:

kgaonkar6 · 2021-04-13T16:35:02Z

I hadn't thought about all the possible representation for indels , but this is a great point of interest for consensus calling! I looked at what people have done before and found this normalization code from bcbio:
https://github.com/hongenxu/bcbio-nextgen/blob/master/bcbio/variation/normalize.py. They also do consensus ( which they call ensembl) calling and normalize their vcf calls by:

Split multi-allelic variants into single sample records.
Downstream tools like GEMINI don't handle multi-allelic variants.
normalize splits these into multiple single allele variants.
More generally, multi-allelic variants are tricky to represent in
comparisons and storage. There are useful discussions on-going in the
GA4GH about allele/count based approaches to handling this more
generally:
Multiple alts per variant? ga4gh/ga4gh-schemas#169
There are multiple approaches to do the decomposition:
- vt decompose: https://github.com/atks/vt
- vcfbreakmulti: https://github.com/ekg/vcflib
- bcftools norm -m '-any': https://samtools.github.io/bcftools/bcftools.html
- vcf_parser --split: https://github.com/moonso/vcf_parser https://github.com/moonso/genmod
  vt handles the most cases cleanly and correctly reconstructs PLs for
  each genotype, removing other FORMAT items which are not changed so
  the resulting VCF is still valid.
Decompose biallelic block substitutions
As part of normalization, decompose MNPs into SNPs, e.g. AG>CT -> A>C and G>T
There are a few approaches:
- vcfallelicprimitives -t DECOMPOSED --keep-geno: https://github.com/vcflib/vcflib#vcfallelicprimitives
- vt decompose_blocksub: https://genome.sph.umich.edu/wiki/Vt#Decompose_biallelic_block_substitutions
  with -a, align/aggressive mode is turned on, which runs 3-4 longer but resolves
  additional clumped SNPs
  We are using vcfallelicprimitives in this module.
Normalizing indels
Left-align and normalize indels, check if REF alleles match the reference.
- bcftools norm: https://samtools.github.io/bcftools/bcftools.html
- vt normalize: https://github.com/atks/vt
  We are using vt normalize in this module.

I think we already do MNV to SNV in the consensus calling , in addition , they decompose biallelic substitutions and finally vt normalize to normalize the indel calls in vcf from all callers.

I also found some documentation and the pseudo-code here

Would this be something that can be of interest to run upstream before we process the variant calls in the repo to fix this issue of indel representation (and possibly will also fix MNVs upstream) or potentially use their pseudo-code to implement it for maf inputs?

Similar to the above bcbio normalization, normalization is also being developed as a step in our internal D3b consensus calling by @migbro as follows ( let me know if I missed something) :

bcftools norm -m '-any': https://samtools.github.io/bcftools/bcftools.html
vcfallelicprimitives -t DECOMPOSED --keep-geno: https://github.com/vcflib/vcflib#vcfallelicprimitives
vt normalize: https://github.com/atks/vt

He did share an output for the normalization are here's an example normalization a couple deletions below 🤔 .

Raw mutect2 maf

Hugo_Symbol	Center	NCBI_Build	Chromosome	Start_Position	End_Position	Strand	Variant_Classification	Variant_Type	Reference_Allele	Tumor_Seq_Allele1	Tumor_Seq_Allele2	dbSNP_RS	dbSNP_Val_Status	Tumor_Sample_Barcode	Matched_Norm_Sample_Barcode	Match_Norm_Seq_Allele1	Match_Norm_Seq_Allele2	Tumor_Validation_Allele1	Tumor_Validation_Allele2	Match_Norm_Validation_Allele1	Match_Norm_Validation_Allele2	Verification_Status	Validation_Status	Mutation_Status	Sequencing_Phase	Sequence_Source	Validation_Method	Score	BAM_File	Sequencer	Tumor_Sample_UUID	Matched_Norm_Sample_UUID	HGVSc	HGVSp	HGVSp_Short	Transcript_ID	Exon_Number	t_depth	t_ref_count	t_alt_count	n_depth	n_ref_count	n_alt_count	all_effects	Allele	Gene	Feature	Feature_type
LGR6	.	GRCh38	chr1	202237984	202237984	+	Intron	DEL	A	A	-	rs113958266	NA	BS_M5FM63EB	BS_9H6Z0MEG	A	A	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	c.428+1991del	NA	NA	ENST00000367278	NA	72	57	5	28	19	9	LGR6,intron_variant,,ENST00000255432,NM_021636.2;LGR6,intron_variant,,ENST00000367278,NM_001017403.1;LGR6,intron_variant,,ENST00000423542,;LGR6,intron_variant,,ENST00000439764,NM_001017404.1;LGR6,intron_variant,,ENST00000487787,;LGR6,downstream_gene_variant,,ENST00000503519,;	C	ENSG00000133067	ENST00000367278	Transcript
LGR6	.	GRCh38	chr1	202239345	202239350	+	Intron	DEL	GTGTGT	GTGTGT	-	rs1460759624	NA	BS_M5FM63EB	BS_9H6Z0MEG	GTGTGT	GTGTGT	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	c.428+3352_428+3357del	NA	NA	ENST00000367278	NA	68	58	2	16	15	1	LGR6,intron_variant,,ENST00000255432,NM_021636.2;LGR6,intron_variant,,ENST00000367278,NM_001017403.1;LGR6,intron_variant,,ENST00000423542,;LGR6,intron_variant,,ENST00000439764,NM_001017404.1;LGR6,intron_variant,,ENST00000487787,;LGR6,downstream_gene_variant,,ENST00000503519,;	-	ENSG00000133067	ENST00000367278	Transcript

Normalized mutect2 vcf in maf format

Hugo_Symbol	Center	NCBI_Build	Chromosome	Start_Position	End_Position	Strand	Variant_Classification	Variant_Type	Reference_Allele	Tumor_Seq_Allele1	Tumor_Seq_Allele2	dbSNP_RS	dbSNP_Val_Status	Tumor_Sample_Barcode	Matched_Norm_Sample_Barcode	Match_Norm_Seq_Allele1	Match_Norm_Seq_Allele2	Tumor_Sample_UUID	Matched_Norm_Sample_UUID	HGVSc	HGVSp	HGVSp_Short	Transcript_ID	Exon_Number	t_depth	t_ref_count	t_alt_count	n_depth	n_ref_count	n_alt_count	all_effects	Allele	Gene	Feature	Feature_type	Consequence	cDNA_position	CDS_position	Protein_position	Amino_acids	Codons	Existing_variation	ALLELE_NUM	DISTANCE	STRAND_VEP	SYMBOL	SYMBOL_SOURCE	HGNC_ID	BIOTYPE
LGR6	.	GRCh38	chr1	202237984	202237984	+	Intron	DEL	A	A	-	rs113958266	NA	BS_M5FM63EB	BS_9H6Z0MEG	A	A	NA	NA	c.428+2005del	NA	NA	ENST00000367278	NA	72	57	5	28	19	9	LGR6,intron_variant,,ENST00000255432,NM_021636.2;LGR6,intron_variant,,ENST00000367278,NM_001017403.1;LGR6,intron_variant,,ENST00000423542,;LGR6,intron_variant,,ENST00000439764,NM_001017404.1;LGR6,intron_variant,,ENST00000487787,;LGR6,downstream_gene_variant,,ENST00000503519,;	-	ENSG00000133067	ENST00000367278	Transcript	intron_variant	NA	NA	NA	NA	NA	rs113958266	1	NA	1	LGR6	HGNC	HGNC:19719	protein_coding
LGR6	.	GRCh38	chr1	202237983	202237983	+	Intron	SNP	C	C	A	rs76646640	NA	BS_M5FM63EB	BS_9H6Z0MEG	C	C	NA	NA	c.428+1990C>A	NA	NA	ENST00000367278	NA	72	57	5	28	19	0	LGR6,intron_variant,,ENST00000255432,NM_021636.2;LGR6,intron_variant,,ENST00000367278,NM_001017403.1;LGR6,intron_variant,,ENST00000423542,;LGR6,intron_variant,,ENST00000439764,NM_001017404.1;LGR6,intron_variant,,ENST00000487787,;LGR6,downstream_gene_variant,,ENST00000503519,;	A	ENSG00000133067	ENST00000367278	Transcript	intron_variant	NA	NA	NA	NA	NA	rs76646640	1	NA	1	LGR6	HGNC	HGNC:19719	protein_coding
LGR6	.	GRCh38	chr1	202237983	202237984	+	Intron	INS	-	-	A	rs5780111	NA	BS_M5FM63EB	BS_9H6Z0MEG	-	-	NA	NA	c.428+2005dup	NA	NA	ENST00000367278	NA	72	57	5	28	19	0	LGR6,intron_variant,,ENST00000255432,NM_021636.2;LGR6,intron_variant,,ENST00000367278,NM_001017403.1;LGR6,intron_variant,,ENST00000423542,;LGR6,intron_variant,,ENST00000439764,NM_001017404.1;LGR6,intron_variant,,ENST00000487787,;LGR6,downstream_gene_variant,,ENST00000503519,;	A	ENSG00000133067	ENST00000367278	Transcript	intron_variant	NA	NA	NA	NA	NA	rs5780111	1	NA	1	LGR6	HGNC	HGNC:19719	protein_coding
LGR6	.	GRCh38	chr1	202239345	202239350	+	Intron	DEL	GTGTGT	GTGTGT	-	rs1460759624	NA	BS_M5FM63EB	BS_9H6Z0MEG	GTGTGT	GTGTGT	NA	NA	c.428+3387_428+3392del	NA	NA	ENST00000367278	NA	68	58	2	16	15	1	LGR6,intron_variant,,ENST00000255432,NM_021636.2;LGR6,intron_variant,,ENST00000367278,NM_001017403.1;LGR6,intron_variant,,ENST00000423542,;LGR6,intron_variant,,ENST00000439764,NM_001017404.1;LGR6,intron_variant,,ENST00000487787,;LGR6,downstream_gene_variant,,ENST00000503519,;	-	ENSG00000133067	ENST00000367278	Transcript	intron_variant	NA	NA	NA	NA	NA	rs1460759624	1	NA	1	LGR6	HGNC	HGNC:19719	protein_coding
LGR6	.	GRCh38	chr1	202239345	202239346	+	Intron	DEL	GT	GT	-	rs1392575432	NA	BS_M5FM63EB	BS_9H6Z0MEG	GT	GT	NA	NA	c.428+3391_428+3392del	NA	NA	ENST00000367278	NA	68	58	8	16	15	0	LGR6,intron_variant,,ENST00000255432,NM_021636.2;LGR6,intron_variant,,ENST00000367278,NM_001017403.1;LGR6,intron_variant,,ENST00000423542,;LGR6,intron_variant,,ENST00000439764,NM_001017404.1;LGR6,intron_variant,,ENST00000487787,;LGR6,downstream_gene_variant,,ENST00000503519,;	-	ENSG00000133067	ENST00000367278	Transcript	intron_variant	NA	NA	NA	NA	NA	rs1392575432	1	NA	1	LGR6	HGNC	HGNC:19719	protein_coding

jashapiro · 2021-04-13T18:58:24Z

Thanks @kgaonkar6! I think you see some of the issues I was thinking about. I think the MNV issue should be fine; we were sort of saved by the fact that strelka2 does not call any MNVs, so we had to decompose the others right off. But INDELs remain an issue that I guess we never really resolved.

It does seem to me that applying a normalization algorithm to all of the vcf files would be worth doing for these data, if it was not done already.

That said, I'm definitely confused by some of the output from normalization you are presenting from @migbro.

The first site was something I also noticed with mutect2 (and only mutect2) where it would call a DEL but the 'Allele' column reflected a SNP (which is not the same as either Tumor_Seq_Allele) Where does the C come from there? Is this data that was in the VCF but did not make it to the MAF? If so, why do they all end up with the same level of support as shown in t_ref and t_alt?) This may be a mutect2 specific thing? I will also note that mutect2 seems to have an odd preference for even sized deletions: it was the only caller with more 2 nt deletions than 1 nt, so I kind of don't trust it and worry there may have been some underlying bugs.
Similarly, I don't understand why the second deletion is being split into two overlapping deletions. In this case there is at least different levels of support for the two introns, so can I assume there is some data in the VCF that caused this which we could not see in the MAF?

If the answers to my confusion involve the original VCF files, then it seems like we will have to have you do a normalization step, which would potentially improve some of the indel consensus calls.

Once that is done we would probably want to do some quick analysis of the hotspot calls to see what effects requiring 2 or 3 callers to agree might have. I remain concerned about including variants even in the "scavenged" set that are only called by
a single caller, especially if that caller is VarDict.

jharenza · 2021-04-13T20:29:12Z

If the answers to my confusion involve the original VCF files, then it seems like we will have to have you do a normalization step, which would potentially improve some of the indel consensus calls.

@migbro and @yuankunzhu - do we have this performed pre-d3b consensus?

migbro · 2021-04-13T20:59:57Z

Not yet. I currently have in review a workflow to bring our vcfs to a new spec. This means the following will happen:

Normalize the vcf using the above described method
Run annotation subwf:
- add standard FORMAT fields: ONLY FOR STRELKA2 (GT, AF, AD). This helps for downstream compatibility for some tools
- VEP annotate
- Annotate using bcftools to add AF from gnomad (the reference that is used for mutect2)
- Add soft filter, recommend AF < 0.001 and norm sample DP <= 7
- Annotate hotspots (protected vcf final)
- Run vcf2maf (protected maf final)
- Hard filter selecting PASS or HotSpotAllele=1 (public vcf final)
- Run vcf2maf on that output for public version (public maf final)
  We hope to have this reviewed and approved by the end of the week. After which the ops team will test a small subset to shake out any unexpected kinks, double check inputs. Then we will run on all PBTA. Run time should be around 20 minutes per caller, per sample. So a rosy estimate once we feel good about going full throttle:
  (1000 sample x 20 minutes)/700 concurrent jobs = less than an hour. Then adding metadata to files...prolly half a day maybe? Things like file export to bucket and KF portal registration are longer and I cannot predict, but if you can work off of cavatica until then, should be good.

migbro · 2021-04-13T21:13:22Z

Ah! I knew that estimate was a little too good, multiply that by 4. so more like 4 hours is the rosy estimate to generate outputs start-to-finish.

jashapiro · 2021-04-14T14:58:59Z

Thanks for your work on this @migbro! I noted elsewhere, but apparently not in this issue that we will also need to update the TCGA set that we are using for comparison here. So the time estimate is going to be even longer, unfortunately.

migbro · 2021-04-15T15:08:45Z

Sure thing. The PR is approved, and @zhangb1 and @yuankunzhu are discussing today how to organize and run this request. As for adding TCGA to that request, I think we'd have to meet to see if we can make that happen, and we'd to loop in someone like @jharenza or someone else who was familiar with what inputs Teja used when she worked with these file sto see if it's doable.

migbro · 2021-04-22T15:35:46Z

Ok, a quick update on this. TCGA is done in terms of normalizing vcfs and creating per-caller maf files. There are two sets of outputs for each caller, one considered "protected" the other "public". This is because of a recent effort to attempt to employ "germline masking." For us, that entails dropping calls that have a norm depth < 8 and/or gnomad AD > 0.001 for the public version (except hotspots). Hotspots using v2 cancer hotspots are also marked, you can of course add more or just ignore them. The protected version will just have a soft filter - meaning calls meeting that criteria just have a not in the FILTER field, but are NOT dropped. I am not sure of what comes next, but the files are on cavatica here: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/openpbta-tcga-ad-hoc-caller-rerun/
and I can add whomever needs to be added for the next steps.
As for PBTA, that normalization is in progress and if all goes well, it should be done by end of day. Then the ops team will add metadata to those files.

As for this protected vs public thing, D3b, for our public cavatica projects and provisional pedcbioportal projects will be using those public outputs. I am not sure if openPBTA is expected/encouraged to follow suit, but just putting it out there in case someone is wondering why there are two sets of outputs, etc. No one seems to have thought of the ramifications for groups outside of D3b yet, or given me any guidance on that 😬 Maybe we ask Adam R. about it?

migbro · 2021-04-23T22:17:38Z

Ok, these files:

-rw-rw-r-- 1 ubuntu ubuntu 202M Apr 23 21:57 pbta-snv-lancet.vep.maf.gz
-rw-rw-r-- 1 ubuntu ubuntu 772M Apr 23 22:01 pbta-snv-mutect2.vep.maf.gz
-rw-rw-r-- 1 ubuntu ubuntu 632M Apr 23 21:33 pbta-snv-strelka2.vep.maf.gz
-rw-rw-r-- 1 ubuntu ubuntu 268M Apr 23 21:58 pbta-snv-vardict.vep.maf.gz
-rw-rw-r-- 1 ubuntu ubuntu  15M Apr 23 21:36 pbta-tcga-snv-lancet.vep.maf.gz
-rw-rw-r-- 1 ubuntu ubuntu  13M Apr 23 21:52 pbta-tcga-snv-mutect2.vep.maf.gz
-rw-rw-r-- 1 ubuntu ubuntu  16M Apr 22 22:09 pbta-tcga-snv-strelka2.vep.maf.gz

have been uploaded here: https://s3.console.aws.amazon.com/s3/buckets/kf-openaccess-us-east-1-prd-pbta?prefix=data/ (as s3 uri: s3://kf-openaccess-us-east-1-prd-pbta/data/release-v19-20210423/)
I also updated the md5 files

These mafs are those public ones that I was talking about, as @jharenza has recommended that I merge those and add, which is nice to have that kind of synergy (-2 points for business jargon, I know!).
Qualitatively, you won't see too much dropped except for vardict. That got real slimmed down!
Please let me know if there are any issues!

jharenza · 2021-04-23T23:07:47Z

thank you both so much!!

jashapiro · 2021-04-26T13:17:57Z

@migbro Can you clarify a couple of the added columns in the new MAF files? Specifically, what are MS and FETS? HotSpotAllele is pretty self explanatory.

There are also a few things that were removed, most notably Entrez IDs. I don't know if any packages use the Entrez IDs, but that is something we will have to look out for. (I note that validation & ExAC info is now removed as well, but those seemed to all be NA in the old data, so that seems totally reasonable!)

Noting also that we will need to update https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md with the new file format.

jashapiro · 2021-04-26T14:12:21Z

I assumed that the column names were consistent, but that doesn't seem to be the case!
MS and FETS are only in lancet
MQ, MQ0, and QSI are in strelka2
MBQ TLOD in mutect2
MSI MSILEN SOR SSF in vardict

Should the columns be in the MAF files at all, or can they be harmonized?

migbro · 2021-04-26T14:26:58Z

Hi @jashapiro , sure! So the added columns were columns that I subjectively thought might be helpful output that each caller does that others might not. As far as I know, the MAF spec does allow for users to add additional columns on top of the "core" set. Descriptions are from the vcf headers below
Lancet:
MS : Microsatellite mutation (format: #LEN#MOTIF)
FETS: phred-scaled p-value of the Fisher's exact test for tumor-normal allele counts (right-sided)
Strelka2:
MQ: RMS Mapping Quality
MQ0: Total Mapping Quality Zero Reads
QSI: Quality score for any somatic variant, ie. for the ALT haplotype to be present at a significantly different frequency in the tumor and normal
Mutect2:
MBQ: median base quality
TLOD: Log odds ratio score for variant
VarDict:
MSI: MicroSatellite. > 1 indicates MSI
MSILEN: MSI unit repeat length in bp
SOR: Odds ratio
SSF: P-value
They don't have to be in there, but there aren't any plans to create an version without it. I apologize if it causes some trouble. I had asked in separate ticket internal ticket if your consensus method would/could still work with these updates and I was assured they would.
If it helps, that all appear at the end so all the preceding columns are the same, so the additional columns all start right after vcf_pos in the maf

jashapiro · 2021-04-26T14:30:55Z

Thanks @migbro! Yeah, there are some small updates to the consensus scripts that I needed to make, but nothing major. I am dropping those columns for the consensus, but I don't think there is any problem leaving them in the individual caller files, as long as they are documented.

jashapiro · 2021-04-26T15:14:00Z

@migbro One followup: it does seem that the Entrez_Gene_Id is standard based on https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/

As I said, I don't expect to use it, but if the column is expected by others it could definitely cause some trouble at times. Was this removal intentional?

migbro · 2021-04-26T15:24:27Z

Hi @jashapiro , yeah, that's an interesting point. So, it was intentional because we haven't been filling it in. I felt comfortable dropping it following what cBioPortal deemed as minimal for a maf https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#minimal-maf-format, which is admittedly, not much! I can ping Kyle Hernandez from U Chicago to get his opinion on it, who was heavily involved in creating the GDC spec

migbro · 2021-04-26T15:56:33Z

@jashapiro , yes, you are right that is a required column (one of 30), and Kyle has pointed me to a definitive source for that. I will do the following today:

I will ensure that all required columns exists
Test output with the maf tools R package
Update openPBTA mafs
Hopefully you can still work with what is there while I do this. Is it ok to overwrite the files once I have the new ones?

jashapiro · 2021-04-26T15:59:02Z

@migbro Thanks for looking into it! Yes, it should be fine to overwrite the files. Just let me know when the new ones are up!

migbro · 2021-04-27T02:01:40Z

@jashapiro it's up!

jashapiro · 2021-04-27T14:11:48Z

@jharenza The consensus files based on the updated MAFs are now uploaded to https://open-pbta.s3.us-east-1.amazonaws.com/data/snv-consensus/snv-consensus-20210427.zip

I have not checked them in detail, but they seem to be about the right size: a bit smaller than previously, but based on some of the shrinkage in the input files, that seems reasonable?

I hadn't fully processed how much VarDict results shrank! From 4.1 GB to ~270 MB!!! @migbro Am I understanding correctly that this is the result of on filtering on sequencing depth and population frequency? I'm kind of in shock that the change was that dramatic. I wouldn't have expected it to fall below all other callers from where it started, since I would expect that those filters should be the same across callers by site, and any true VarDict calls would be in at least one other caller.

If we weren't so far down the line in analysis, I might consider adding it back to the consensus... I'm running the caller comparison notebook now though, so that may give some insight. I don't actually think the change will be hard, so we might think about it for v20, depending on results.

migbro · 2021-04-27T15:21:57Z

@jashapiro I know right?! I took a peek at the CBTN portion, and summarized the FILTER events:

40961568 GNOMAD_AF_HIGH
1387325 GNOMAD_AF_HIGH;NORM_DP_LOW
  40598 NORM_DP_LOW
2434899 PASS

High gnomad > 0.001 filter is def the big culler of variants.

jashapiro · 2021-04-27T17:02:47Z

@jashapiro I know right?! I took a peek at the CBTN portion, and summarized the FILTER events:
40961568 GNOMAD_AF_HIGH
1387325 GNOMAD_AF_HIGH;NORM_DP_LOW
  40598 NORM_DP_LOW
2434899 PASS
High gnomad > 0.001 filter is def the big culler of variants.

Yeah, but what I don't understand is why so many got culled from vardict that didn't get culled other places. Here is an upset plot from the previous data:

And with the new data:

(Other comparisons can be seen here: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1033/files?short_path=726dc81#diff-726dc818248c6115d51dc51d7f200a1a19b23ecceaf77365ddb2c993057ad0e9)

If we focus on Strelka2, mutect2 and VarDict, we see that before they shared ~3.7 million calls. After, they share only 460K. So it seems like calls are being removed from VarDict that are surviving in mutect & Strelka2. This seems really odd to me, because those should all share the same gnomad frequencies.

By contrast, the mutect + strelka set goes up from 500K to 3.4 million, again suggesting that there are things lost from only VarDict.

Notably, the mutect + strelka + lancet set (our current consensus) stays about the same, so this won't have a big downstream effect at the moment. This makes me less inclined to revisit the consensus criteria.

migbro · 2021-04-27T17:40:17Z

Thanks for looking into this! In looking at the pre-pass file from vardict on a hunch, it seems that VarDict has it's own INFO/AF field, unrelated to gnomad AF tagging: ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">. If I had to guess, what isn't getting overwritten by gnomad AF annotation remains. Oh boy, quite the frustration. Luckily, or logically, the other callers don't have this INFO/AF field, so that's why they look reasonable.

migbro · 2021-04-27T17:41:40Z

So in short, I'd say your current consensus method is probably safe since you are excluding VarDict. In the meantime, I will modify how the annotation is done for VarDict first, but generally will likely strip any existing INFO/AF field and replace in so that only the intended INFO/AF field from gnomad is populated. Thanks again!

migbro · 2021-04-27T20:23:46Z

A quick update, so after adding in a pre-strip INFO step, the numbers look more "sane." A single sample example:
Vardict pre-strip:

  42331 GNOMAD_AF_HIGH
    349 GNOMAD_AF_HIGH;NORM_DP_LOW
     11 NORM_DP_LOW
   2179 PASS

Post strip:

  10661 GNOMAD_AF_HIGH
    101 GNOMAD_AF_HIGH;NORM_DP_LOW
    259 NORM_DP_LOW
  33849 PASS

For comparison, strelka2:

   1692 GNOMAD_AF_HIGH
     18 GNOMAD_AF_HIGH;NORM_DP_LOW
     34 NORM_DP_LOW
   3651 PASS

I'll re-run vardict with this new wf and update the vardict maf when completed.

migbro · 2021-04-28T00:07:50Z

Ok, vardict has been updated

kgaonkar6 · 2021-04-28T13:09:29Z

thanks @migbro! I'll create the CI files for vardict.

@jashapiro just wanted to check if these tmb files in the link https://open-pbta.s3.us-east-1.amazonaws.com/data/snv-consensus/snv-consensus-20210427.zip should be renamed to the file naming convention used in v18?

filenames currrent link
pbta-snv-mutation-tmb-all.tsv
pbta-snv-mutation-tmb-coding.tsv

filenames in v18:
pbta-snv-consensus-mutation-tmb-all.tsv
pbta-snv-consensus-mutation-tmb-coding.tsv

It seems to me the CI script will still work since it looks for all pbta-snv* with the current names though.

jashapiro · 2021-04-28T13:16:58Z

@jashapiro just wanted to check if these tmb files in the link https://open-pbta.s3.us-east-1.amazonaws.com/data/snv-consensus/snv-consensus-20210427.zip should be renamed to the file naming convention used in v18?

filenames currrent link
pbta-snv-mutation-tmb-all.tsv
pbta-snv-mutation-tmb-coding.tsv

filenames in v18:
pbta-snv-consensus-mutation-tmb-all.tsv
pbta-snv-consensus-mutation-tmb-coding.tsv

It seems to me the CI script will still work since it looks for all pbta-snv* with the current names though.

Yes, we should keep the file names the same as before! Sorry about that: I didn't confirm that the script output file names matched.

jashapiro · 2021-05-25T17:30:46Z

With the changes to normalized vcfs and updated files associated with the release of v19 #1026, I believe this can be closed. While it is possible that more improvements could be made, I think that would require substantial time and effort to fully assess and implement.

jashapiro added the updated analysis label Apr 9, 2021

jashapiro self-assigned this Apr 9, 2021

jharenza added the blocking release label Apr 13, 2021

jharenza mentioned this issue Apr 23, 2021

INDEL renormalization of MAFs #1024

Closed

jharenza mentioned this issue Apr 23, 2021

Updated manuscript: SNV calls AlexsLemonade/OpenPBTA-manuscript#130

Closed

jashapiro mentioned this issue Apr 26, 2021

#819 Part2 : Scavenge back hotspots to add to consensus calls #961

Merged

5 tasks

jashapiro mentioned this issue Apr 27, 2021

Update MAF fields for v19 SNV consensus #1033

Merged

5 tasks

jashapiro mentioned this issue Apr 27, 2021

V19 ci files #1038

Merged

5 tasks

This was referenced May 3, 2021

PBTA Histologies v19 Part 8 of N: Run molecular subtyping #1028

Merged

#819 Part1 update: Script to add maf col types #1050

Merged

jashapiro closed this as completed May 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated analysis: SNV consensus calling (for things other than substitutions) #990

Updated analysis: SNV consensus calling (for things other than substitutions) #990

jashapiro commented Apr 9, 2021

kgaonkar6 commented Apr 13, 2021 •

edited

Loading

jashapiro commented Apr 13, 2021

jharenza commented Apr 13, 2021

migbro commented Apr 13, 2021

migbro commented Apr 13, 2021

jashapiro commented Apr 14, 2021

migbro commented Apr 15, 2021

migbro commented Apr 22, 2021

migbro commented Apr 23, 2021

jharenza commented Apr 23, 2021

jashapiro commented Apr 26, 2021

jashapiro commented Apr 26, 2021 •

edited

Loading

migbro commented Apr 26, 2021

jashapiro commented Apr 26, 2021 •

edited

Loading

jashapiro commented Apr 26, 2021

migbro commented Apr 26, 2021

migbro commented Apr 26, 2021 •

edited

Loading

jashapiro commented Apr 26, 2021

migbro commented Apr 27, 2021

jashapiro commented Apr 27, 2021

migbro commented Apr 27, 2021 •

edited

Loading

jashapiro commented Apr 27, 2021

migbro commented Apr 27, 2021

migbro commented Apr 27, 2021

migbro commented Apr 27, 2021

migbro commented Apr 28, 2021

kgaonkar6 commented Apr 28, 2021

jashapiro commented Apr 28, 2021

jashapiro commented May 25, 2021

Updated analysis: SNV consensus calling (for things other than substitutions) #990

Updated analysis: SNV consensus calling (for things other than substitutions) #990

Comments

jashapiro commented Apr 9, 2021

What analysis module should be updated and why?

What changes need to be made? Please provide enough detail for another participant to make the update.

What input data should be used? Which data were used in the version being updated?

When do you expect the revised analysis will be completed?

Who will complete the updated analysis?

kgaonkar6 commented Apr 13, 2021 • edited Loading

jashapiro commented Apr 13, 2021

jharenza commented Apr 13, 2021

migbro commented Apr 13, 2021

migbro commented Apr 13, 2021

jashapiro commented Apr 14, 2021

migbro commented Apr 15, 2021

migbro commented Apr 22, 2021

migbro commented Apr 23, 2021

jharenza commented Apr 23, 2021

jashapiro commented Apr 26, 2021

jashapiro commented Apr 26, 2021 • edited Loading

migbro commented Apr 26, 2021

jashapiro commented Apr 26, 2021 • edited Loading

jashapiro commented Apr 26, 2021

migbro commented Apr 26, 2021

migbro commented Apr 26, 2021 • edited Loading

jashapiro commented Apr 26, 2021

migbro commented Apr 27, 2021

jashapiro commented Apr 27, 2021

migbro commented Apr 27, 2021 • edited Loading

jashapiro commented Apr 27, 2021

migbro commented Apr 27, 2021

migbro commented Apr 27, 2021

migbro commented Apr 27, 2021

migbro commented Apr 28, 2021

kgaonkar6 commented Apr 28, 2021

jashapiro commented Apr 28, 2021

jashapiro commented May 25, 2021

kgaonkar6 commented Apr 13, 2021 •

edited

Loading

jashapiro commented Apr 26, 2021 •

edited

Loading

jashapiro commented Apr 26, 2021 •

edited

Loading

migbro commented Apr 26, 2021 •

edited

Loading

migbro commented Apr 27, 2021 •

edited

Loading