Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some images translated to text using Tesseract 4 throw an error regarding "contains_unichar_id" #1205

Closed
sallyhill opened this issue Nov 9, 2017 · 46 comments

Comments

@sallyhill
Copy link

sallyhill commented Nov 9, 2017

Environment

  • Tesseract Version: 4.00.00alpha
  • Commit Number: I used brew install tesseract --HEAD to install
  • Platform: 15.6.0 Darwin Kernel Version 15.6.0: Mon Oct 2 22:20:08 PDT 2017; root:xnu-3248.71.4~1/RELEASE_X86_64 x86_64 (osx)
    Files affected:
    bptbo
    jithy
    lrggj
    ouifv

Current Behavior:

text to string of these images throws a TesseractError that prints: (-6, 'contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513') on the attached files

Expected Behavior:

No error.

Suggested Fix:

I am not sure. Right now I'm just running pytesseract.image_to_string in a try block

@sallyhill
Copy link
Author

Many other couple thousand images going in worked well, but these images were the error images.

@psinger
Copy link

psinger commented Dec 18, 2017

Any solution to that? Having similar issues.

@amitdo
Copy link
Collaborator

amitdo commented Jan 8, 2018

@sallyhill , @psinger
Does this happen with:

  • --oem 1
  • --oem 0

?

@psinger
Copy link

psinger commented Jan 9, 2018

@amitdo

I just tried it, and it works with both options.

Any idea what's going on?

@amitdo
Copy link
Collaborator

amitdo commented Jan 9, 2018

Seems like a bug in combining the two OCR engines.

@psinger
Copy link

psinger commented Jan 9, 2018

Any way to track this down further?

@amitdo
Copy link
Collaborator

amitdo commented Jan 9, 2018

You can use GDB to see the function call chain.

Frankly, I only use --oem 1 (or 3 with best/fast traineddata), so I'm not so motivated to invest time on this issue. Sorry.

@syzer
Copy link

syzer commented Jan 22, 2018

👍

@stweil
Copy link
Contributor

stweil commented Feb 8, 2018

I get the reported assertion with the second image (all other images work for me) and will have a look.

@amitdo
Copy link
Collaborator

amitdo commented Feb 8, 2018

@stweil,

Same assert was reported in:
#1154 #1177 #1181 #1222 #1223 #1232 #1237 #1307
Also see PR #1286

@Shreeshrii
Copy link
Collaborator

@stweil #1423

@Shreeshrii
Copy link
Collaborator

New report #1601

@Shreeshrii
Copy link
Collaborator

@zdenop Please label as bug.

@zdenop zdenop added the bug label May 25, 2018
@ghost
Copy link

ghost commented May 31, 2018

have you found any solution for this? my pdf has Arabic and English both. I'm facing the same issue.contains_unichar_id(unichar_id):Error:Assert failed:in file c:\projects\github\tesseract-ocr\src\ccutil\unicharset.h, line 511
Exception in thread "main" java.lang.Error: Invalid memory access
at com.sun.jna.Native.invokePointer(Native Method)
at com.sun.jna.Function.invokePointer(Function.java:470)
at com.sun.jna.Function.invoke(Function.java:404)
at com.sun.jna.Function.invoke(Function.java:315)
at com.sun.jna.Library$Handler.invoke(Library.java:212)
at com.sun.proxy.$Proxy1.TessBaseAPIGetUTF8Text(Unknown Source)
at net.sourceforge.tess4j.Tesseract.getOCRText(Tesseract.java:433)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:288)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:209)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:193)

@syzer
Copy link

syzer commented May 31, 2018

yeah.. i made a patch for it,\that removes this assert.. it's kinda ok'ish
works.. but don't really solve an issue

@ghost
Copy link

ghost commented Jun 1, 2018

Thanks syzer. from where I can get the patch. Please share.
Could you please guide me to prepare trained data.
Regards

@Shreeshrii
Copy link
Collaborator

Please see #1286
for the patch.

It has not been merged yet.

If you try it please provide feedback.

@ghost
Copy link

ghost commented Jun 1, 2018

Please publish one standard jar file, so that we can explore it. And could you please guide me to create traineddata file.

thanks

@danablanc
Copy link

Hi.
I have the same issue, using Tesseract Open Source OCR Engine vv4.0.0-beta.1.20180608 with Leptonica for Windows. How can I get this patch?

@Shugyousha
Copy link

Shugyousha commented Aug 9, 2018

I can reproduce this and since I haven't seen a stack trace for this yet I will post the one I have:

contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 511

Thread 1 "tesseract" received signal SIGSEGV, Segmentation fault.
ERRCODE::error (this=this@entry=0x7ffff7d774c8 <_ZL13ASSERT_FAILED>, caller=caller@entry=0x7ffff7874630 "contains_unichar_id(unichar_id)", acti
    format=format@entry=0x7ffff7871e41 "in file %s, line %d") at errcode.cpp:86
86            if (!*p)
(gdb) bt
#0  ERRCODE::error (this=this@entry=0x7ffff7d774c8 <_ZL13ASSERT_FAILED>, caller=caller@entry=0x7ffff7874630 "contains_unichar_id(unichar_id)", action=action@entry=ABORT,
    format=format@entry=0x7ffff7871e41 "in file %s, line %d") at errcode.cpp:86
#1  0x00007ffff77c5ef4 in UNICHARSET::get_isdigit (unichar_id=297, this=0x5555559ac990) at ../../src/ccutil/unicharset.h:511
#2  tesseract::Dict::char_for_dawg (dawg=0x555556c3f2d0, ch=297, this=0x555555dfb120) at dict.h:435
#3  tesseract::Dict::def_letter_is_okay(void*, int, bool) const () at dict.cpp:413
#4  0x00007ffff77c624e in tesseract::Dict::valid_word(WERD_CHOICE const&, bool) const () at ../../src/ccstruct/ratngs.h:314
#5  0x00007ffff76c437b in tesseract::Tesseract::recog_word(WERD_RES*) () at tfacepp.cpp:69
#6  0x00007ffff76c1ed3 in tesseract::Tesseract::tess_segment_pass_n (this=this@entry=0x7ffff7fd2010, pass_n=pass_n@entry=1, word=word@entry=0x55555ad33a20) at tessbox.cpp:48
#7  0x00007ffff7674b8e in tesseract::Tesseract::match_word_pass_n(int, WERD_RES*, ROW*, BLOCK*) () at control.cpp:1644
#8  0x00007ffff7674d89 in tesseract::Tesseract::classify_word_pass1 (this=0x7ffff7fd2010, word_data=..., in_word=0x55555acd0780, out_words=<optimized out>)
    at control.cpp:1450
#9  0x00007ffff7676114 in tesseract::Tesseract::RetryWithLanguage(tesseract::WordData const&, void (tesseract::Tesseract::*)(tesseract::WordData const&, WERD_RES**, tesseract::PointerVector<WERD_RES>*), bool, WERD_RES**, tesseract::PointerVector<WERD_RES>*) () at control.cpp:923
#10 0x00007ffff7676944 in tesseract::Tesseract::classify_word_and_language(int, PAGE_RES_IT*, tesseract::WordData*) () at ../../src/ccutil/genericvector.h:716
#11 0x00007ffff767a189 in tesseract::Tesseract::RecogAllWordsPassN(int, ETEXT_DESC*, PAGE_RES_IT*, GenericVector<tesseract::WordData>*) () at control.cpp:276
#12 0x00007ffff767ba43 in tesseract::Tesseract::recog_all_words(PAGE_RES*, ETEXT_DESC*, TBOX const*, char const*, int) () at control.cpp:369
#13 0x00007ffff7663c6e in tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) () at baseapi.cpp:907
#14 0x00007ffff7664002 in tesseract::TessBaseAPI::ProcessPage (this=this@entry=0x5555557592c0 <main::api>, pix=0x55555598a720, page_index=page_index@entry=0,
    filename=filename@entry=0x7fffffffe5fa "0003.jpg", retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=0x555555983800)
    at baseapi.cpp:1217
#15 0x00007ffff7666fe9 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) () at baseapi.cpp:1169
#16 0x00007ffff766711e in tesseract::TessBaseAPI::ProcessPages (this=this@entry=0x5555557592c0 <main::api>, filename=filename@entry=0x7fffffffe5fa "0003.jpg",
    retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=<optimized out>) at baseapi.cpp:1070
#17 0x0000555555556c73 in main () at ../../src/ccutil/genericvector.h:716
#18 0x00007ffff67ff06b in __libc_start_main () from /usr/lib/libc.so.6
#19 0x000055555555729a in _start () at tesseractmain.cpp:602

Looks like the unicode point being provided to get_isdigit is not a valid digit and hits the assertion. Not sure how and why we end up there though.

@Shreeshrii
Copy link
Collaborator

Please check the version of traineddata file that you are using.

Also try with traineddata from tessdata_fast and tessdata_best. Do you get the same error?

@Shugyousha
Copy link

Shugyousha commented Aug 9, 2018 via email

@stweil
Copy link
Contributor

stweil commented Aug 10, 2018

The issue only occurs with models from tessdata (starting with commit d87b3c) and OCR engine mode 2.

@Shreeshrii
Copy link
Collaborator

The issue only occurs with models from tessdata (starting with commit d87b3c) and OCR engine mode 2.

That commit 'Updated LSTM Models to integerized tessdata_best'.

The earlier commit by Ray was on Nov 29, 2016
Added LSTM models+lang models to 101 langs.

However, after that the format of traineddata files has changed to include the recoder. If I remember correctly, those LSTM models do not work/produce accurate recognition results with current code.

2017-07-14 (dc8745e) Ray Smith: Move LSTM unicharset and recoder to traineddata with version string part1. Backwards compatible - maybe.

@stweil
Copy link
Contributor

stweil commented Sep 17, 2018

I consider this to be one of the most important bugs which I'd like to get fixed for 4.0.0, even if it only occurs with models from https://github.com/tesseract-ocr/traineddata when both old and new OCR engine are used (which is still the default). Several possible solutions exist:

  1. Fix it. That's my favourite solution, but I still could not solve it. It would help to have a very short and simple text which triggers the problem (or if someone else finds the correct fix). Removing the assertion is not the correct fix!
  2. Avoid it. That would require changing the default: --oem 3 would no longer be "based on what is available", but "best which is available". Drawback: People would still get the error when running with --oem 2.

@amitdo
Copy link
Collaborator

amitdo commented Sep 17, 2018

"best which is available"

Should be:
best if available,
else legacy if available,
else exit with an error "not a valid traineddata"

@Shreeshrii
Copy link
Collaborator

It will be helpful if @jbreiden can check whether this error also happens with Google's version of tesseract.

@stweil
Copy link
Contributor

stweil commented Oct 1, 2018

See discussion #1849 with some ideas for workaround solutions.

@stweil stweil added this to the 4.0.0 milestone Oct 1, 2018
@amitdo
Copy link
Collaborator

amitdo commented Oct 3, 2018

@stweil, since we want to release 4.0.0 in the next 2-3 weeks and we still don't have a fix for this issue, I think we need to move to plan B (make a workaround).

@stweil
Copy link
Contributor

stweil commented Oct 6, 2018

We don't. I found a fix today. See pull request #1954.

@amitdo
Copy link
Collaborator

amitdo commented Oct 6, 2018

Thanks!

I assume it also solves the other similar reports, right?
#1205 (comment)

@stweil
Copy link
Contributor

stweil commented Oct 6, 2018

Yes, I assume so. @sallyhill, @psinger please test the new code.

@ingwinlu
Copy link

ingwinlu commented Jun 21, 2019

unfortunatly this issue still persists with releases containing the above bugfix (4.0.0 on archlinux)

➜  ~/projects/tesseract git:(master) tesseract --version
tesseract 4.0.0
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.2) : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.2
 Found AVX2
 Found AVX
 Found SSE
(gdb) bt
#0  0x00007effa32860fb in ERRCODE::error(char const*, TessErrorLogCode, char const*, ...) const ()
   from /usr/lib/libtesseract.so.4
#1  0x00007effa31f2a84 in tesseract::Dict::case_ok(WERD_CHOICE const&, UNICHARSET const&) const ()
   from /usr/lib/libtesseract.so.4
#2  0x00007effa31fec28 in tesseract::Dict::AcceptableResult(WERD_RES*) const () from /usr/lib/libtesseract.so.4
#3  0x00007effa30cc734 in tesseract::Tesseract::match_word_pass_n(int, WERD_RES*, ROW*, BLOCK*) ()
   from /usr/lib/libtesseract.so.4
#4  0x00007effa30cc7fa in tesseract::Tesseract::classify_word_pass1(tesseract::WordData const&, WERD_RES**, tesseract::PointerVector<WERD_RES>*) () from /usr/lib/libtesseract.so.4
#5  0x00007effa30ce0c7 in tesseract::Tesseract::RetryWithLanguage(tesseract::WordData const&, void (tesseract::Tesseract::*)(tesseract::WordData const&, WERD_RES**, tesseract::PointerVector<WERD_RES>*), bool, WERD_RES**, tesseract::PointerVector<WERD_RES>*) () from /usr/lib/libtesseract.so.4
#6  0x00007effa30ce7f1 in tesseract::Tesseract::classify_word_and_language(int, PAGE_RES_IT*, tesseract::WordData*)
    () from /usr/lib/libtesseract.so.4
#7  0x00007effa30d1240 in tesseract::Tesseract::RecogAllWordsPassN(int, ETEXT_DESC*, PAGE_RES_IT*, GenericVector<tesseract::WordData>*) () from /usr/lib/libtesseract.so.4
#8  0x00007effa30d2f84 in tesseract::Tesseract::recog_all_words(PAGE_RES*, ETEXT_DESC*, TBOX const*, char const*, int) () from /usr/lib/libtesseract.so.4
#9  0x00007effa30bc6b3 in tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) () from /usr/lib/libtesseract.so.4
#10 0x00007effa30bca2b in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.4
#11 0x00007effa30bd6f5 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.4
#12 0x00007effa30bd8af in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.4
#13 0x000055bb5496cc96 in main ()

bad news is that I can not share the file causing it.

@amitdo
Copy link
Collaborator

amitdo commented Jun 21, 2019

Try using --oem 1 as a workaround.

@stweil
Copy link
Contributor

stweil commented Jun 21, 2019

@ingwinlu, it would help to have a reproducible test case. Perhaps you can find a shareable image, or you can send me your image via e-mail.

@buerge3
Copy link

buerge3 commented Jan 3, 2020

I get the error: "Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513"
When running it on the following test image:
filter
The problem persists even when running with --oem 1

@zdenop
Copy link
Contributor

zdenop commented Jan 3, 2020

Your tesseract version is very very old. Use the latest code when dealing with issue.

@buerge3
Copy link

buerge3 commented Jan 3, 2020

i have the latest version
image

@zdenop
Copy link
Contributor

zdenop commented Jan 3, 2020

you wrote:

I get the error: "Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513"

@buerge3
Copy link

buerge3 commented Jan 3, 2020

yes, that is the error i am getting. I could not find any instructions for installing Tesseract on RedHat, so I used the instructions given by this guy's blog:
https://www.mail-archive.com/tesseract-ocr@googlegroups.com/msg15794.html

@zdenop
Copy link
Contributor

zdenop commented Jan 3, 2020

If you get that error you are not using the latest code/version. And it is not tesseract issue.

@buerge3
Copy link

buerge3 commented Jan 3, 2020

I uninstalled tessaract and reinstalled it using the instructions given here: https://github.com/tesseract-ocr/tesseract/wiki
The problem still persists. I notice that tesseract-lang is only version 4.00, which does not match the version 4.1.0 of tesseract itself. Could this be what is causing the issue, and if so then how do I get the most recent version of tesseract-lang?

@Hemant2022
Copy link

I am getting same error even when I try to use no config. Is this issue still closed??

@Shreeshrii
Copy link
Collaborator

Please post tesseract version, which traineddata you used and the image giving error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests