Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 #17

Closed
LynnLy opened this issue Jul 13, 2022 · 6 comments
Closed

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 #17

LynnLy opened this issue Jul 13, 2022 · 6 comments

Comments

@LynnLy
Copy link

LynnLy commented Jul 13, 2022

Hi there,

I used Hapo-G on my genome, starting with a bam file created by bwa-mem2. I installed Hapo-G via conda. My command was:

hapog.py \
    --genome flye/assembly.fasta \
    -b flye/paired.mapping.bam \
    -o 1_hapo_g/ -t 5 -u 

The resulting file in hapog_results/hapog.fasta mostly contains the polished sequences, but it has some nonsense lines at the ends of some contigs which break downstream analysis. For example:

GAAGTGCTCAAGGTCCCTTCTTTATACTCCACCACTCTCGTGTTTATCGTCCCGAACCTT
:
00887:409:HN3LMDRXY:2:2259:2293:9940TC!<90><B7><B2><F5>MU <B1><86>^A

or

GGATGCTCGTTTGTCCATTGTTTGTCCATCTAAG<B1><86>^A

The error message was:

Traceback (most recent call last):
  File "/home/OXFORDNANOLABS/lly/miniconda3/envs/hapog/bin/hapog.py", line 161, in <module>
    pipeline.include_unpolished(args.input_genome)
  File "/mmfs1/uslinuxhome/lly/miniconda3/envs/hapog/bin/lib/pipeline.py", line 246, in include_unpolished
    for line in open("hapog_results/hapog.fasta"):
  File "/home/OXFORDNANOLABS/lly/miniconda3/envs/hapog/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 3574: invalid start byte
@bistace
Copy link
Collaborator

bistace commented Jul 18, 2022

Hello,

just to let you know that I am working on this issue. I will get back to you shortly with a possible fix.

bistace added a commit that referenced this issue Jul 19, 2022
@bistace
Copy link
Collaborator

bistace commented Jul 19, 2022

Hello,

I couldn't replicate your issue but still tried to push a potential fix. Could you please update your conda installation to Hapo-G 1.3.1 and rerun your command?

@LynnLy
Copy link
Author

LynnLy commented Jul 20, 2022

I updated to 1.3.1 and have a slightly different error now. It looks like the fasta files under hapog_chunks ie. hapog_chunks/chunks_2.fasta contain the invalid text, but the fasta files under chunks do not.

Traceback (most recent call last):
  File "/home/OXFORDNANOLABS/lly/miniconda3/envs/hapog/bin/hapog.py", line 152, in <module>
    pipeline.merge_results(int(args.threads))
  File "/mmfs1/uslinuxhome/lly/miniconda3/envs/hapog/bin/lib/pipeline.py", line 189, in merge_results
    shutil.copyfileobj(fd, out)
  File "/home/OXFORDNANOLABS/lly/miniconda3/envs/hapog/lib/python3.10/shutil.py", line 195, in copyfileobj
    buf = fsrc_read(length)
  File "/home/OXFORDNANOLABS/lly/miniconda3/envs/hapog/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8f in position 15018: invalid start byte

An example of weird lines: It mostly looks like gibberish to me, but "A00887:498:H3VLKDRX2:1:2137:3631:19304" is an example of one of my readnames. Please let me know if it would be helpful to send you a snippet of the fasta or other files!

GATTTATCGAACGCAGAGTGGCTGATCGTGTGCCTTTCCCCCTGTTAAACATGTTGGAAT
CTATC0TAATG9AACAAGAGCXTTGTTTCAAGTTTGCAATAAAGATAATCTATTGATTCC
TCTTTCTGAAGGCATCATCAAACGGGTTATTTGTAAACGTGCTTTTTTGTTTCT<88>CCTTC
TTTCCTGTTTTCCTTCTTH<88>^T^QD^Q^TH^X(B"^X^R<82>H<82><82>!^X!<84>^XA^Q^Q^QA^X^Q^R^Q<84>"^R<81>^Q^Q^X<88>((
<82><88>!<81>!^Q"<82>(<88><88>^X<82>^X<81>!^X^X<88>%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%^Y%%%%%%%%^Y%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%^Y%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%^K%%%%%%%%%%%%%%%%%%%%%NMC^EMDZ102^GG3A128MCZ10M1I23M2D
21M1I168M26SASC<CF>XSC<A5>M,3;XSC^^GAG^P^BA F<DB>G<B9>U<B0><A6><DC>G<B9>U<A3>(^C<E9>
^A<B3><EA><D0>`<DB>G<B9>U
<EF>^A^B`0A00887:498:H3VLKDRX2:1:2137:3631:19304!<D0><C5><DB>G<B9>U!P<DC>G<B9>U AA0
0887:498:H3VLKDRX2:1:2230:4878:16548GGCG<B9>U@AA00887:498:H3VLK
DRX2:1:2163:5032:5415@!<C0>1<DC>G<B9>U !<E0><F7><DB>G<B9>UDRX2:1:2!^P<DC><DB>G<B9>U 1A00887
:498:H3VLKDRX2:1:2214:16984:25551!`<A2><DB>G<B9>U Q^A`"<DC>G<B9>U<D8>\^Z<E9>Q+CAGTG
AGCGTTTCTCTTTCATGCTTGGCATAGGCTTTTACGACCTCGACGAGATTGGGAATGAGG

@bistace
Copy link
Collaborator

bistace commented Jul 21, 2022

If it is possible, could you grant us access to your assembly fasta and fastq reads so we can check ourselves that everything is fine with your files?

Otherwise, it seems that there could be unexpected characters in your fasta/fastq file (the ^R looks like ctlr+r character as an example). Could you please check that your input files are correctly formatted?

@LynnLy
Copy link
Author

LynnLy commented Jul 21, 2022

I was able to subset my data to one contig on which the error is reproducible. Do you have a preferred email address so that I can share the data privately?

@bistace
Copy link
Collaborator

bistace commented Jul 22, 2022

That's great! You can send it to bistace@genoscope.cns.fr and we will have a look at it.

@bistace bistace closed this as completed Aug 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants