UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 #17

LynnLy · 2022-07-13T18:04:13Z

Hi there,

I used Hapo-G on my genome, starting with a bam file created by bwa-mem2. I installed Hapo-G via conda. My command was:

hapog.py \
    --genome flye/assembly.fasta \
    -b flye/paired.mapping.bam \
    -o 1_hapo_g/ -t 5 -u

The resulting file in hapog_results/hapog.fasta mostly contains the polished sequences, but it has some nonsense lines at the ends of some contigs which break downstream analysis. For example:

GAAGTGCTCAAGGTCCCTTCTTTATACTCCACCACTCTCGTGTTTATCGTCCCGAACCTT
:
00887:409:HN3LMDRXY:2:2259:2293:9940TC!<90><B7><B2><F5>MU <B1><86>^A

or

GGATGCTCGTTTGTCCATTGTTTGTCCATCTAAG<B1><86>^A

The error message was:

Traceback (most recent call last):
  File "/home/OXFORDNANOLABS/lly/miniconda3/envs/hapog/bin/hapog.py", line 161, in <module>
    pipeline.include_unpolished(args.input_genome)
  File "/mmfs1/uslinuxhome/lly/miniconda3/envs/hapog/bin/lib/pipeline.py", line 246, in include_unpolished
    for line in open("hapog_results/hapog.fasta"):
  File "/home/OXFORDNANOLABS/lly/miniconda3/envs/hapog/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 3574: invalid start byte

The text was updated successfully, but these errors were encountered:

bistace · 2022-07-18T06:19:52Z

Hello,

just to let you know that I am working on this issue. I will get back to you shortly with a possible fix.

bistace · 2022-07-19T09:15:14Z

Hello,

I couldn't replicate your issue but still tried to push a potential fix. Could you please update your conda installation to Hapo-G 1.3.1 and rerun your command?

LynnLy · 2022-07-20T23:40:56Z

I updated to 1.3.1 and have a slightly different error now. It looks like the fasta files under hapog_chunks ie. hapog_chunks/chunks_2.fasta contain the invalid text, but the fasta files under chunks do not.

Traceback (most recent call last):
  File "/home/OXFORDNANOLABS/lly/miniconda3/envs/hapog/bin/hapog.py", line 152, in <module>
    pipeline.merge_results(int(args.threads))
  File "/mmfs1/uslinuxhome/lly/miniconda3/envs/hapog/bin/lib/pipeline.py", line 189, in merge_results
    shutil.copyfileobj(fd, out)
  File "/home/OXFORDNANOLABS/lly/miniconda3/envs/hapog/lib/python3.10/shutil.py", line 195, in copyfileobj
    buf = fsrc_read(length)
  File "/home/OXFORDNANOLABS/lly/miniconda3/envs/hapog/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8f in position 15018: invalid start byte

An example of weird lines: It mostly looks like gibberish to me, but "A00887:498:H3VLKDRX2:1:2137:3631:19304" is an example of one of my readnames. Please let me know if it would be helpful to send you a snippet of the fasta or other files!

GATTTATCGAACGCAGAGTGGCTGATCGTGTGCCTTTCCCCCTGTTAAACATGTTGGAAT
CTATC0TAATG9AACAAGAGCXTTGTTTCAAGTTTGCAATAAAGATAATCTATTGATTCC
TCTTTCTGAAGGCATCATCAAACGGGTTATTTGTAAACGTGCTTTTTTGTTTCT<88>CCTTC
TTTCCTGTTTTCCTTCTTH<88>^T^QD^Q^TH^X(B"^X^R<82>H<82><82>!^X!<84>^XA^Q^Q^QA^X^Q^R^Q<84>"^R<81>^Q^Q^X<88>((
<82><88>!<81>!^Q"<82>(<88><88>^X<82>^X<81>!^X^X<88>%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%^Y%%%%%%%%^Y%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%^Y%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%^K%%%%%%%%%%%%%%%%%%%%%NMC^EMDZ102^GG3A128MCZ10M1I23M2D
21M1I168M26SASC<CF>XSC<A5>M,3;XSC^^GAG^P^BA F<DB>G<B9>U<B0><A6><DC>G<B9>U<A3>(^C<E9>
^A<B3><EA><D0>`<DB>G<B9>U
<EF>^A^B`0A00887:498:H3VLKDRX2:1:2137:3631:19304!<D0><C5><DB>G<B9>U!P<DC>G<B9>U AA0
0887:498:H3VLKDRX2:1:2230:4878:16548GGCG<B9>U@AA00887:498:H3VLK
DRX2:1:2163:5032:5415@!<C0>1<DC>G<B9>U !<E0><F7><DB>G<B9>UDRX2:1:2!^P<DC><DB>G<B9>U 1A00887
:498:H3VLKDRX2:1:2214:16984:25551!`<A2><DB>G<B9>U Q^A`"<DC>G<B9>U<D8>\^Z<E9>Q+CAGTG
AGCGTTTCTCTTTCATGCTTGGCATAGGCTTTTACGACCTCGACGAGATTGGGAATGAGG

bistace · 2022-07-21T07:37:02Z

If it is possible, could you grant us access to your assembly fasta and fastq reads so we can check ourselves that everything is fine with your files?

Otherwise, it seems that there could be unexpected characters in your fasta/fastq file (the ^R looks like ctlr+r character as an example). Could you please check that your input files are correctly formatted?

LynnLy · 2022-07-21T22:40:38Z

I was able to subset my data to one contig on which the error is reproducible. Do you have a preferred email address so that I can share the data privately?

bistace · 2022-07-22T05:37:01Z

That's great! You can send it to bistace@genoscope.cns.fr and we will have a look at it.

bistace added a commit that referenced this issue Jul 19, 2022

Potential fix for #17

d7d1c81

bistace closed this as completed Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 #17

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 #17

LynnLy commented Jul 13, 2022

bistace commented Jul 18, 2022

bistace commented Jul 19, 2022

LynnLy commented Jul 20, 2022

bistace commented Jul 21, 2022

LynnLy commented Jul 21, 2022

bistace commented Jul 22, 2022

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 #17

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 #17

Comments

LynnLy commented Jul 13, 2022

bistace commented Jul 18, 2022

bistace commented Jul 19, 2022

LynnLy commented Jul 20, 2022

bistace commented Jul 21, 2022

LynnLy commented Jul 21, 2022

bistace commented Jul 22, 2022