Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted sequences in Fauna #16

Closed
jameshadfield opened this issue Apr 24, 2024 · 4 comments
Closed

Corrupted sequences in Fauna #16

jameshadfield opened this issue Apr 24, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@jameshadfield
Copy link
Member

jameshadfield commented Apr 24, 2024

Fauna output has (what I presume is) corrupted data for certain sequences such as:

>A/egret/Korea/22WC603/2023|avian_flu|EPIEPI2738274|2023-03-06|japan_korea|south_korea|south_korea|south_korea|avian|?|h5n1|national_institute_of_wildlife_disease_control_and_prevention(niwdc)|national_institute_of_wildlife_disease_control_and_prevention|?|?|2.3.4.4b|?
aegretkoreawcpbatggatgtcaatccgactttacttttcttaaaagtgccagcgcaaaatgccataagtacc...

(Notice the start of the nuc sequence includes parts of the strain name)

This data raises no errors during the parse, filter, align steps but IQ-TREE will crash with a warning:

ERROR: Sequence A_DELIM-BZEZHVQHONIWPWOQAAAV_egret_DELIM-BZEZHVQHONIWPWOQAAAV_Korea_DELIM-BZEZHVQHONIWPWOQAAAV_22WC603_DELIM-BZEZHVQHONIWPWOQAAAV_2023 has invalid character E at site 17
...

I observed this while running builds on AWS to test #11 and I can reproduce locally as well (it's somewhat stochastic as it depends on if the sequence makes it though filtering. I think this may be the only such strain, but it is corrupted for pb1 and pb2.

I suggest we add it to the exclude list, unless others have more knowledge here?

@jameshadfield jameshadfield added the bug Something isn't working label Apr 24, 2024
@jameshadfield jameshadfield changed the title Corrupted sequences in Faunta Corrupted sequences in Fauna Apr 24, 2024
@jameshadfield
Copy link
Member Author

There's at least one other corrupted strain - this one in the h5nx mp segment:

>A/chicken/Poland/003/2020|avian_flu|EPI1800021|2020-01-02|europe|poland|poland|poland|avian|domestic|h5n8|national_veterinary_research_institut_poland,_piwet_pib|national_veterinary_research_institut_poland,_piwet_pib|E. Swieton, K. Smietanka|?|2.3.4.4b|?
achickenpolandhnmpagatattgaaagatgagtcttctaaccgaggtcgaaacgtacgttct...

@jameshadfield
Copy link
Member Author

A simplistic check for segments which have part of their strain in the first 100 nucs identifies one more, so in total there are:

data/h5n1_pb1.fasta
>A/egret/Korea/22WC603/2023
aegretkoreawcpbatggatgtcaatccgactttacttttcttaaaagtgccagcgcaaaatgccataagtaccacattcccttatactggagatcctc

data/h5n1_pb2.fasta
>A/egret/Korea/22WC603/2023
aegretkoreawcpbatggagagaataaaagaactaagagatttaatgtcgcagtctcgcactcgcgagatactgacaaaaaccactgtggaccatatgg

data/h5nx_mp.fasta
>A/chicken/Poland/003/2020
achickenpolandhnmpagatattgaaagatgagtcttctaaccgaggtcgaaacgtacgttctctctatcgtcccgtcaggccccctcaaagccgaga

data/h5nx_pb1.fasta
>A/egret/Korea/22WC603/2023
aegretkoreawcpbatggatgtcaatccgactttacttttcttaaaagtgccagcgcaaaatgccataagtaccacattcccttatactggagatcctc

data/h5nx_pb2.fasta
>A/egret/Korea/22WC603/2023
aegretkoreawcpbatggagagaataaaagaactaagagatttaatgtcgcagtctcgcactcgcgagatactgacaaaaaccactgtggaccatatgg

data/h9n2_pb2.fasta
>A/environment/Wuxi/2505/2014
aenvironmentwuxihnatgggaagaataaaagaactaagagatttgatgtcacagtctcgcactcgcgagatactgacaaaaacaacagtggaccata

Checked via check-fauna.py:

import sys
import re
r = re.compile('^[atgc]+$')
with open(sys.argv[1]) as fh:
    for line in fh:
        if line.startswith('>'):
            strain = line.lstrip('>').split('|')[0]
            words = [w.lower() for w in strain.split('/') if len(w)>3 and not r.match(w.lower())]
        else:
            if any([word in line[0:100] for word in words]):
                print(f">{strain}\n{line[0:100]}")

Then run via for i in data/*; do echo $i && python check-fauna.py $i; done

@joverlee521
Copy link
Contributor

Interesting...I suspect this corruption originally come from GISAID. I don't see how this could have come from the fauna upload process.

If they are correct in GISAID, then they can be re-uploaded with the --overwrite flag to update the existing sequence.

@jameshadfield
Copy link
Member Author

Strains excluded in d6f8330

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants