Corrupted sequences in Fauna #16

jameshadfield · 2024-04-24T08:08:29Z

Fauna output has (what I presume is) corrupted data for certain sequences such as:

>A/egret/Korea/22WC603/2023|avian_flu|EPIEPI2738274|2023-03-06|japan_korea|south_korea|south_korea|south_korea|avian|?|h5n1|national_institute_of_wildlife_disease_control_and_prevention(niwdc)|national_institute_of_wildlife_disease_control_and_prevention|?|?|2.3.4.4b|?
aegretkoreawcpbatggatgtcaatccgactttacttttcttaaaagtgccagcgcaaaatgccataagtacc...

(Notice the start of the nuc sequence includes parts of the strain name)

This data raises no errors during the parse, filter, align steps but IQ-TREE will crash with a warning:

ERROR: Sequence A_DELIM-BZEZHVQHONIWPWOQAAAV_egret_DELIM-BZEZHVQHONIWPWOQAAAV_Korea_DELIM-BZEZHVQHONIWPWOQAAAV_22WC603_DELIM-BZEZHVQHONIWPWOQAAAV_2023 has invalid character E at site 17
...

I observed this while running builds on AWS to test #11 and I can reproduce locally as well (it's somewhat stochastic as it depends on if the sequence makes it though filtering. I think this may be the only such strain, but it is corrupted for pb1 and pb2.

I suggest we add it to the exclude list, unless others have more knowledge here?

The text was updated successfully, but these errors were encountered:

jameshadfield · 2024-04-24T08:20:57Z

There's at least one other corrupted strain - this one in the h5nx mp segment:

>A/chicken/Poland/003/2020|avian_flu|EPI1800021|2020-01-02|europe|poland|poland|poland|avian|domestic|h5n8|national_veterinary_research_institut_poland,_piwet_pib|national_veterinary_research_institut_poland,_piwet_pib|E. Swieton, K. Smietanka|?|2.3.4.4b|?
achickenpolandhnmpagatattgaaagatgagtcttctaaccgaggtcgaaacgtacgttct...

jameshadfield · 2024-04-24T09:21:54Z

A simplistic check for segments which have part of their strain in the first 100 nucs identifies one more, so in total there are:

data/h5n1_pb1.fasta
>A/egret/Korea/22WC603/2023
aegretkoreawcpbatggatgtcaatccgactttacttttcttaaaagtgccagcgcaaaatgccataagtaccacattcccttatactggagatcctc

data/h5n1_pb2.fasta
>A/egret/Korea/22WC603/2023
aegretkoreawcpbatggagagaataaaagaactaagagatttaatgtcgcagtctcgcactcgcgagatactgacaaaaaccactgtggaccatatgg

data/h5nx_mp.fasta
>A/chicken/Poland/003/2020
achickenpolandhnmpagatattgaaagatgagtcttctaaccgaggtcgaaacgtacgttctctctatcgtcccgtcaggccccctcaaagccgaga

data/h5nx_pb1.fasta
>A/egret/Korea/22WC603/2023
aegretkoreawcpbatggatgtcaatccgactttacttttcttaaaagtgccagcgcaaaatgccataagtaccacattcccttatactggagatcctc

data/h5nx_pb2.fasta
>A/egret/Korea/22WC603/2023
aegretkoreawcpbatggagagaataaaagaactaagagatttaatgtcgcagtctcgcactcgcgagatactgacaaaaaccactgtggaccatatgg

data/h9n2_pb2.fasta
>A/environment/Wuxi/2505/2014
aenvironmentwuxihnatgggaagaataaaagaactaagagatttgatgtcacagtctcgcactcgcgagatactgacaaaaacaacagtggaccata

Checked via check-fauna.py:

import sys
import re
r = re.compile('^[atgc]+$')
with open(sys.argv[1]) as fh:
    for line in fh:
        if line.startswith('>'):
            strain = line.lstrip('>').split('|')[0]
            words = [w.lower() for w in strain.split('/') if len(w)>3 and not r.match(w.lower())]
        else:
            if any([word in line[0:100] for word in words]):
                print(f">{strain}\n{line[0:100]}")

Then run via for i in data/*; do echo $i && python check-fauna.py $i; done

joverlee521 · 2024-04-24T17:28:47Z

Interesting...I suspect this corruption originally come from GISAID. I don't see how this could have come from the fauna upload process.

If they are correct in GISAID, then they can be re-uploaded with the --overwrite flag to update the existing sequence.

jameshadfield · 2024-04-24T21:24:23Z

Strains excluded in d6f8330

jameshadfield added the bug Something isn't working label Apr 24, 2024

jameshadfield changed the title ~~Corrupted sequences in Faunta~~ Corrupted sequences in Fauna Apr 24, 2024

jameshadfield closed this as completed Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted sequences in Fauna #16

Corrupted sequences in Fauna #16

jameshadfield commented Apr 24, 2024 •

edited

Loading

jameshadfield commented Apr 24, 2024

jameshadfield commented Apr 24, 2024

joverlee521 commented Apr 24, 2024

jameshadfield commented Apr 24, 2024

Corrupted sequences in Fauna #16

Corrupted sequences in Fauna #16

Comments

jameshadfield commented Apr 24, 2024 • edited Loading

jameshadfield commented Apr 24, 2024

jameshadfield commented Apr 24, 2024

joverlee521 commented Apr 24, 2024

jameshadfield commented Apr 24, 2024

jameshadfield commented Apr 24, 2024 •

edited

Loading