Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insertion REF alleles and deletion ALT alleles are always set to "N" #422

Open
eblerjana opened this issue Aug 28, 2023 · 4 comments
Open
Assignees

Comments

@eblerjana
Copy link

Hi,

I'm working with SV callsets produced by sniffles2 (v2.0.7). For deletions, the ALT sequence in the VCFs are alwasy set to N, while the REF field contains the reference sequence. For insertions, it is the other way around. Here, the REF allele is always N (which does not match the actual reference sequence at these positions). This leads to several problems when applying tools like bcftools to post-process these VCFs (e. g. bcftools norm --check-ref reports mismatches with the reference genome when REF is set to N).

To me, this looks like the N is used in the REF/ALT field is representing an empty sequence? If this is the case, in order to fix my sniffles2 VCFs, can I simply modify my VCFs by adding the reference base before the variant to the left of REF + ALT alleles, following the VCF specifications (4.2):

For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field)

Or is there a script already to fix the VCFs to contain the actual reference sequence instead of N?

Thanks,
Jana

@defendant602
Copy link

defendant602 commented Oct 12, 2023

got the same problem, ref allele of insertion and alt allele of deletion are always N. (sniffles version 2.2)

@yaningyang
Copy link

sniffles contains an optional parameter --reference
--reference reference.fasta (Optional) Reference sequence the reads were aligned against. To enable output of deletion SV sequences, this parameter must be set. (default: None)

@nextgenusfs
Copy link

nextgenusfs commented Dec 6, 2023

--reference does not fix the issue in v2.2, the reference alleles are still "N". This is incorrect (it seems) based on the VCF 4.2 spec. To fix seems like the position should be moved "left" 1 bp and that should be used as the ref allele for INS and the alt allele for DEL. Here I'm trying to call from a de novo assembly against the reference.

$ sniffles --version
Sniffles2, Version 2.2

$ minimap2 -ax asm5 genome.fasta query-genome.fasta | samtools sort -o sniff.bam - 

$ sniffles -i sniff.bam -t 1 --no-qc --reference genome.fasta -v sniff.vcf

And then here is an example of an INS where ref allele is N. Note I've shortened the ALT allele sequence here for readability.

chr3	84234	Sniffles2.INS.1S1	N	ATAA...AATTC	60	PASS	PRECISE;SVTYPE=INS;SVLEN=4658;END=84234;SUPPORT=1;COVERAGE=1,1,1,1,1;STRAND=+;AF=1.000;STDEV_LEN=0;STDEV_POS=0;SUPPORT_LONG=0	GT:GQ:DR:DV	1/1:5:0:2

And then also an example of a DEL where ALT == 'N' (shortened the REF allele here for readability).

chr6    404818  Sniffles2.DEL.6S6       CCG...GCGA      N       60      SUPPORT_MIN     PRECISE;SVTYPE=DEL;SVLEN=-966;END=405784;SUPPORT=1;COVERAGE=1,1,1,1,1;STRAND=+;AF=1.000;STDEV_LEN=0;STDEV_POS=0      GT:GQ:DR:DV     1/1:2:0:1

@ethering
Copy link

--reference does not fix the issue in v2.2, the reference alleles are still "N". This is incorrect (it seems) based on the VCF 4.2 spec. To fix seems like the position should be moved "left" 1 bp and that should be used as the ref allele for INS and the alt allele for DEL. Here I'm trying to call from a de novo assembly against the reference.

$ sniffles --version
Sniffles2, Version 2.2

$ minimap2 -ax asm5 genome.fasta query-genome.fasta | samtools sort -o sniff.bam - 

$ sniffles -i sniff.bam -t 1 --no-qc --reference genome.fasta -v sniff.vcf

And then here is an example of an INS where ref allele is N. Note I've shortened the ALT allele sequence here for readability.

chr3	84234	Sniffles2.INS.1S1	N	ATAA...AATTC	60	PASS	PRECISE;SVTYPE=INS;SVLEN=4658;END=84234;SUPPORT=1;COVERAGE=1,1,1,1,1;STRAND=+;AF=1.000;STDEV_LEN=0;STDEV_POS=0;SUPPORT_LONG=0	GT:GQ:DR:DV	1/1:5:0:2

And then also an example of a DEL where ALT == 'N' (shortened the REF allele here for readability).

chr6    404818  Sniffles2.DEL.6S6       CCG...GCGA      N       60      SUPPORT_MIN     PRECISE;SVTYPE=DEL;SVLEN=-966;END=405784;SUPPORT=1;COVERAGE=1,1,1,1,1;STRAND=+;AF=1.000;STDEV_LEN=0;STDEV_POS=0      GT:GQ:DR:DV     1/1:2:0:1

I wonder if this is the same issue that I raised for Survivor, which perhaps would be better served here:
fritzsedlazeck/SURVIVOR#202

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants