-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted sequences in Fauna #16
Comments
There's at least one other corrupted strain - this one in the h5nx mp segment:
|
A simplistic check for segments which have part of their strain in the first 100 nucs identifies one more, so in total there are: data/h5n1_pb1.fasta
>A/egret/Korea/22WC603/2023
aegretkoreawcpbatggatgtcaatccgactttacttttcttaaaagtgccagcgcaaaatgccataagtaccacattcccttatactggagatcctc
data/h5n1_pb2.fasta
>A/egret/Korea/22WC603/2023
aegretkoreawcpbatggagagaataaaagaactaagagatttaatgtcgcagtctcgcactcgcgagatactgacaaaaaccactgtggaccatatgg
data/h5nx_mp.fasta
>A/chicken/Poland/003/2020
achickenpolandhnmpagatattgaaagatgagtcttctaaccgaggtcgaaacgtacgttctctctatcgtcccgtcaggccccctcaaagccgaga
data/h5nx_pb1.fasta
>A/egret/Korea/22WC603/2023
aegretkoreawcpbatggatgtcaatccgactttacttttcttaaaagtgccagcgcaaaatgccataagtaccacattcccttatactggagatcctc
data/h5nx_pb2.fasta
>A/egret/Korea/22WC603/2023
aegretkoreawcpbatggagagaataaaagaactaagagatttaatgtcgcagtctcgcactcgcgagatactgacaaaaaccactgtggaccatatgg
data/h9n2_pb2.fasta
>A/environment/Wuxi/2505/2014
aenvironmentwuxihnatgggaagaataaaagaactaagagatttgatgtcacagtctcgcactcgcgagatactgacaaaaacaacagtggaccata Checked via import sys
import re
r = re.compile('^[atgc]+$')
with open(sys.argv[1]) as fh:
for line in fh:
if line.startswith('>'):
strain = line.lstrip('>').split('|')[0]
words = [w.lower() for w in strain.split('/') if len(w)>3 and not r.match(w.lower())]
else:
if any([word in line[0:100] for word in words]):
print(f">{strain}\n{line[0:100]}") Then run via |
Interesting...I suspect this corruption originally come from GISAID. I don't see how this could have come from the fauna upload process. If they are correct in GISAID, then they can be re-uploaded with the |
Strains excluded in d6f8330 |
Fauna output has (what I presume is) corrupted data for certain sequences such as:
(Notice the start of the nuc sequence includes parts of the strain name)
This data raises no errors during the
parse
,filter
,align
steps but IQ-TREE will crash with a warning:I observed this while running builds on AWS to test #11 and I can reproduce locally as well (it's somewhat stochastic as it depends on if the sequence makes it though filtering. I think this may be the only such strain, but it is corrupted for pb1 and pb2.
I suggest we add it to the exclude list, unless others have more knowledge here?
The text was updated successfully, but these errors were encountered: