some images translated to text using Tesseract 4 throw an error regarding "contains_unichar_id" #1205

sallyhill · 2017-11-09T19:36:36Z

Environment

Tesseract Version: 4.00.00alpha
Commit Number: I used brew install tesseract --HEAD to install
Platform: 15.6.0 Darwin Kernel Version 15.6.0: Mon Oct 2 22:20:08 PDT 2017; root:xnu-3248.71.4~1/RELEASE_X86_64 x86_64 (osx)
Files affected:

Current Behavior:

text to string of these images throws a TesseractError that prints: (-6, 'contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513') on the attached files

Expected Behavior:

No error.

Suggested Fix:

I am not sure. Right now I'm just running pytesseract.image_to_string in a try block

sallyhill · 2017-11-09T19:43:24Z

Many other couple thousand images going in worked well, but these images were the error images.

psinger · 2017-12-18T15:31:49Z

Any solution to that? Having similar issues.

amitdo · 2018-01-08T20:35:21Z

@sallyhill , @psinger
Does this happen with:

--oem 1
--oem 0

?

psinger · 2018-01-09T08:42:22Z

@amitdo

I just tried it, and it works with both options.

Any idea what's going on?

amitdo · 2018-01-09T10:07:13Z

Seems like a bug in combining the two OCR engines.

psinger · 2018-01-09T15:59:14Z

Any way to track this down further?

amitdo · 2018-01-09T18:13:39Z

You can use GDB to see the function call chain.

Frankly, I only use --oem 1 (or 3 with best/fast traineddata), so I'm not so motivated to invest time on this issue. Sorry.

syzer · 2018-01-22T14:39:19Z

👍

stweil · 2018-02-08T14:55:06Z

I get the reported assertion with the second image (all other images work for me) and will have a look.

amitdo · 2018-02-08T16:35:54Z

@stweil,

Same assert was reported in:
#1154 #1177 #1181 #1222 #1223 #1232 #1237 #1307
Also see PR #1286

Shreeshrii · 2018-03-30T09:28:08Z

@stweil #1423

Shreeshrii · 2018-05-25T08:24:35Z

New report #1601

Shreeshrii · 2018-05-25T08:26:22Z

@zdenop Please label as bug.

ghost · 2018-05-31T17:41:05Z

have you found any solution for this? my pdf has Arabic and English both. I'm facing the same issue.contains_unichar_id(unichar_id):Error:Assert failed:in file c:\projects\github\tesseract-ocr\src\ccutil\unicharset.h, line 511
Exception in thread "main" java.lang.Error: Invalid memory access
at com.sun.jna.Native.invokePointer(Native Method)
at com.sun.jna.Function.invokePointer(Function.java:470)
at com.sun.jna.Function.invoke(Function.java:404)
at com.sun.jna.Function.invoke(Function.java:315)
at com.sun.jna.Library$Handler.invoke(Library.java:212)
at com.sun.proxy.$Proxy1.TessBaseAPIGetUTF8Text(Unknown Source)
at net.sourceforge.tess4j.Tesseract.getOCRText(Tesseract.java:433)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:288)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:209)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:193)

syzer · 2018-05-31T18:53:25Z

yeah.. i made a patch for it,\that removes this assert.. it's kinda ok'ish
works.. but don't really solve an issue

ghost · 2018-06-01T08:44:27Z

Thanks syzer. from where I can get the patch. Please share.
Could you please guide me to prepare trained data.
Regards

Shreeshrii · 2018-06-01T15:26:31Z

Please see #1286
for the patch.

It has not been merged yet.

If you try it please provide feedback.

ghost · 2018-06-01T15:50:59Z

Please publish one standard jar file, so that we can explore it. And could you please guide me to create traineddata file.

thanks

danablanc · 2018-07-23T07:55:52Z

Hi.
I have the same issue, using Tesseract Open Source OCR Engine vv4.0.0-beta.1.20180608 with Leptonica for Windows. How can I get this patch?

Shugyousha · 2018-08-09T13:01:58Z

I can reproduce this and since I haven't seen a stack trace for this yet I will post the one I have:

contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 511

Thread 1 "tesseract" received signal SIGSEGV, Segmentation fault.
ERRCODE::error (this=this@entry=0x7ffff7d774c8 <_ZL13ASSERT_FAILED>, caller=caller@entry=0x7ffff7874630 "contains_unichar_id(unichar_id)", acti
    format=format@entry=0x7ffff7871e41 "in file %s, line %d") at errcode.cpp:86
86            if (!*p)
(gdb) bt
#0  ERRCODE::error (this=this@entry=0x7ffff7d774c8 <_ZL13ASSERT_FAILED>, caller=caller@entry=0x7ffff7874630 "contains_unichar_id(unichar_id)", action=action@entry=ABORT,
    format=format@entry=0x7ffff7871e41 "in file %s, line %d") at errcode.cpp:86
#1  0x00007ffff77c5ef4 in UNICHARSET::get_isdigit (unichar_id=297, this=0x5555559ac990) at ../../src/ccutil/unicharset.h:511
#2  tesseract::Dict::char_for_dawg (dawg=0x555556c3f2d0, ch=297, this=0x555555dfb120) at dict.h:435
#3  tesseract::Dict::def_letter_is_okay(void*, int, bool) const () at dict.cpp:413
#4  0x00007ffff77c624e in tesseract::Dict::valid_word(WERD_CHOICE const&, bool) const () at ../../src/ccstruct/ratngs.h:314
#5  0x00007ffff76c437b in tesseract::Tesseract::recog_word(WERD_RES*) () at tfacepp.cpp:69
#6  0x00007ffff76c1ed3 in tesseract::Tesseract::tess_segment_pass_n (this=this@entry=0x7ffff7fd2010, pass_n=pass_n@entry=1, word=word@entry=0x55555ad33a20) at tessbox.cpp:48
#7  0x00007ffff7674b8e in tesseract::Tesseract::match_word_pass_n(int, WERD_RES*, ROW*, BLOCK*) () at control.cpp:1644
#8  0x00007ffff7674d89 in tesseract::Tesseract::classify_word_pass1 (this=0x7ffff7fd2010, word_data=..., in_word=0x55555acd0780, out_words=<optimized out>)
    at control.cpp:1450
#9  0x00007ffff7676114 in tesseract::Tesseract::RetryWithLanguage(tesseract::WordData const&, void (tesseract::Tesseract::*)(tesseract::WordData const&, WERD_RES**, tesseract::PointerVector<WERD_RES>*), bool, WERD_RES**, tesseract::PointerVector<WERD_RES>*) () at control.cpp:923
#10 0x00007ffff7676944 in tesseract::Tesseract::classify_word_and_language(int, PAGE_RES_IT*, tesseract::WordData*) () at ../../src/ccutil/genericvector.h:716
#11 0x00007ffff767a189 in tesseract::Tesseract::RecogAllWordsPassN(int, ETEXT_DESC*, PAGE_RES_IT*, GenericVector<tesseract::WordData>*) () at control.cpp:276
#12 0x00007ffff767ba43 in tesseract::Tesseract::recog_all_words(PAGE_RES*, ETEXT_DESC*, TBOX const*, char const*, int) () at control.cpp:369
#13 0x00007ffff7663c6e in tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) () at baseapi.cpp:907
#14 0x00007ffff7664002 in tesseract::TessBaseAPI::ProcessPage (this=this@entry=0x5555557592c0 <main::api>, pix=0x55555598a720, page_index=page_index@entry=0,
    filename=filename@entry=0x7fffffffe5fa "0003.jpg", retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=0x555555983800)
    at baseapi.cpp:1217
#15 0x00007ffff7666fe9 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) () at baseapi.cpp:1169
#16 0x00007ffff766711e in tesseract::TessBaseAPI::ProcessPages (this=this@entry=0x5555557592c0 <main::api>, filename=filename@entry=0x7fffffffe5fa "0003.jpg",
    retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=<optimized out>) at baseapi.cpp:1070
#17 0x0000555555556c73 in main () at ../../src/ccutil/genericvector.h:716
#18 0x00007ffff67ff06b in __libc_start_main () from /usr/lib/libc.so.6
#19 0x000055555555729a in _start () at tesseractmain.cpp:602

Looks like the unicode point being provided to get_isdigit is not a valid digit and hits the assertion. Not sure how and why we end up there though.

Shreeshrii · 2018-08-09T14:33:57Z

Please check the version of traineddata file that you are using.

Also try with traineddata from tessdata_fast and tessdata_best. Do you get the same error?

Shugyousha · 2018-08-09T17:00:24Z

On Thu, Aug 09, 2018 at 07:34:40AM -0700, Shreeshrii wrote: Please check the version of traineddata file that you are using.

I used an about 2 week old version of the models in the tesseract-data github repo.

Also try with traineddata from tessdata_fast and tessdata_best. Do you get the same error?

Sadly I don't have access to the installation at the moment because I am off work and will be going on holiday tomorrow. I will make a note in my calendar to check this after I am back. Cheers, Silvan

stweil · 2018-08-10T14:11:47Z

The issue only occurs with models from tessdata (starting with commit d87b3c) and OCR engine mode 2.

Shreeshrii · 2018-08-21T09:30:04Z

The issue only occurs with models from tessdata (starting with commit d87b3c) and OCR engine mode 2.

That commit 'Updated LSTM Models to integerized tessdata_best'.

The earlier commit by Ray was on Nov 29, 2016
Added LSTM models+lang models to 101 langs.

However, after that the format of traineddata files has changed to include the recoder. If I remember correctly, those LSTM models do not work/produce accurate recognition results with current code.

2017-07-14 (dc8745e) Ray Smith: Move LSTM unicharset and recoder to traineddata with version string part1. Backwards compatible - maybe.

stweil · 2018-09-17T18:03:29Z

I consider this to be one of the most important bugs which I'd like to get fixed for 4.0.0, even if it only occurs with models from https://github.com/tesseract-ocr/traineddata when both old and new OCR engine are used (which is still the default). Several possible solutions exist:

Fix it. That's my favourite solution, but I still could not solve it. It would help to have a very short and simple text which triggers the problem (or if someone else finds the correct fix). Removing the assertion is not the correct fix!
Avoid it. That would require changing the default: --oem 3 would no longer be "based on what is available", but "best which is available". Drawback: People would still get the error when running with --oem 2.

amitdo · 2018-09-17T19:14:26Z

"best which is available"

Should be:
best if available,
else legacy if available,
else exit with an error "not a valid traineddata"

Shreeshrii · 2018-09-17T20:39:34Z

It will be helpful if @jbreiden can check whether this error also happens with Google's version of tesseract.

stweil · 2018-10-01T15:17:46Z

See discussion #1849 with some ideas for workaround solutions.

amitdo · 2018-10-03T18:25:02Z

@stweil, since we want to release 4.0.0 in the next 2-3 weeks and we still don't have a fix for this issue, I think we need to move to plan B (make a workaround).

stweil · 2018-10-06T11:26:28Z

We don't. I found a fix today. See pull request #1954.

amitdo · 2018-10-06T13:11:36Z

Thanks!

I assume it also solves the other similar reports, right?
#1205 (comment)

stweil · 2018-10-06T13:21:06Z

Yes, I assume so. @sallyhill, @psinger please test the new code.

ingwinlu · 2019-06-21T07:43:02Z

unfortunatly this issue still persists with releases containing the above bugfix (4.0.0 on archlinux)

➜  ~/projects/tesseract git:(master) tesseract --version
tesseract 4.0.0
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.2) : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.2
 Found AVX2
 Found AVX
 Found SSE

(gdb) bt
#0  0x00007effa32860fb in ERRCODE::error(char const*, TessErrorLogCode, char const*, ...) const ()
   from /usr/lib/libtesseract.so.4
#1  0x00007effa31f2a84 in tesseract::Dict::case_ok(WERD_CHOICE const&, UNICHARSET const&) const ()
   from /usr/lib/libtesseract.so.4
#2  0x00007effa31fec28 in tesseract::Dict::AcceptableResult(WERD_RES*) const () from /usr/lib/libtesseract.so.4
#3  0x00007effa30cc734 in tesseract::Tesseract::match_word_pass_n(int, WERD_RES*, ROW*, BLOCK*) ()
   from /usr/lib/libtesseract.so.4
#4  0x00007effa30cc7fa in tesseract::Tesseract::classify_word_pass1(tesseract::WordData const&, WERD_RES**, tesseract::PointerVector<WERD_RES>*) () from /usr/lib/libtesseract.so.4
#5  0x00007effa30ce0c7 in tesseract::Tesseract::RetryWithLanguage(tesseract::WordData const&, void (tesseract::Tesseract::*)(tesseract::WordData const&, WERD_RES**, tesseract::PointerVector<WERD_RES>*), bool, WERD_RES**, tesseract::PointerVector<WERD_RES>*) () from /usr/lib/libtesseract.so.4
#6  0x00007effa30ce7f1 in tesseract::Tesseract::classify_word_and_language(int, PAGE_RES_IT*, tesseract::WordData*)
    () from /usr/lib/libtesseract.so.4
#7  0x00007effa30d1240 in tesseract::Tesseract::RecogAllWordsPassN(int, ETEXT_DESC*, PAGE_RES_IT*, GenericVector<tesseract::WordData>*) () from /usr/lib/libtesseract.so.4
#8  0x00007effa30d2f84 in tesseract::Tesseract::recog_all_words(PAGE_RES*, ETEXT_DESC*, TBOX const*, char const*, int) () from /usr/lib/libtesseract.so.4
#9  0x00007effa30bc6b3 in tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) () from /usr/lib/libtesseract.so.4
#10 0x00007effa30bca2b in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.4
#11 0x00007effa30bd6f5 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.4
#12 0x00007effa30bd8af in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.4
#13 0x000055bb5496cc96 in main ()

bad news is that I can not share the file causing it.

amitdo · 2019-06-21T21:09:14Z

Try using --oem 1 as a workaround.

stweil · 2019-06-21T21:27:46Z

@ingwinlu, it would help to have a reproducible test case. Perhaps you can find a shareable image, or you can send me your image via e-mail.

buerge3 · 2020-01-03T18:17:13Z

I get the error: "Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513"
When running it on the following test image:

The problem persists even when running with --oem 1

zdenop · 2020-01-03T18:19:56Z

Your tesseract version is very very old. Use the latest code when dealing with issue.

buerge3 · 2020-01-03T18:26:29Z

i have the latest version

zdenop · 2020-01-03T18:31:41Z

you wrote:

I get the error: "Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513"

buerge3 · 2020-01-03T18:39:26Z

yes, that is the error i am getting. I could not find any instructions for installing Tesseract on RedHat, so I used the instructions given by this guy's blog:
https://www.mail-archive.com/tesseract-ocr@googlegroups.com/msg15794.html

zdenop · 2020-01-03T18:48:21Z

If you get that error you are not using the latest code/version. And it is not tesseract issue.

buerge3 · 2020-01-03T19:05:34Z

I uninstalled tessaract and reinstalled it using the instructions given here: https://github.com/tesseract-ocr/tesseract/wiki
The problem still persists. I notice that tesseract-lang is only version 4.00, which does not match the version 4.1.0 of tesseract itself. Could this be what is causing the issue, and if so then how do I get the most recent version of tesseract-lang?

Hemant2022 · 2020-12-11T12:02:01Z

I am getting same error even when I try to use no config. Is this issue still closed??

Shreeshrii · 2020-12-11T12:27:16Z

Please post tesseract version, which traineddata you used and the image giving error.

amitdo mentioned this issue Jan 18, 2018

get_isdigit() failing causes program to abort #1279

Closed

amitdo mentioned this issue Feb 8, 2018

RFC: Remove the legacy OCR Engine #707

Closed

Shreeshrii mentioned this issue May 25, 2018

Segmentation fault OCRing a washed out image #1601

Closed

zdenop added the bug label May 25, 2018

bozhodimitrov mentioned this issue Aug 18, 2018

Unicharset error madmaze/pytesseract#146

Closed

amitdo mentioned this issue Aug 19, 2018

Specific file causes error #1849

Closed

Shreeshrii mentioned this issue Sep 22, 2018

Recognition fails with assertion contains_unichar_id(unichar_id) #1925

Closed

stweil mentioned this issue Oct 1, 2018

Tesseract contains_unichar_id error #1237

Closed

stweil added this to the 4.0.0 milestone Oct 1, 2018

stweil mentioned this issue Oct 6, 2018

Fix use of wrong UNICHARSET #1954

Merged

zdenop closed this as completed Oct 6, 2018

stweil mentioned this issue Oct 6, 2018

dict: Wrong values for hyphen_unichar_id_ and slash_unichar_id_ in comparision #1955

Open

zdenop mentioned this issue Oct 9, 2018

Tesseract crashes when processing certain documents #1181

Closed

amitdo added the unexpected termination label Mar 24, 2021

This was referenced May 21, 2021

Recurring error cfculhane/AnkiOCR#20

Closed

fix #20 cfculhane/AnkiOCR#21

Closed

some images translated to text using Tesseract 4 throw an error regarding "contains_unichar_id" #1205

some images translated to text using Tesseract 4 throw an error regarding "contains_unichar_id" #1205

Comments

sallyhill commented Nov 9, 2017 • edited by stweil Loading

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

sallyhill commented Nov 9, 2017

psinger commented Dec 18, 2017

amitdo commented Jan 8, 2018 • edited Loading

psinger commented Jan 9, 2018 • edited Loading

amitdo commented Jan 9, 2018

psinger commented Jan 9, 2018

amitdo commented Jan 9, 2018

syzer commented Jan 22, 2018

stweil commented Feb 8, 2018

amitdo commented Feb 8, 2018 • edited Loading

Shreeshrii commented Mar 30, 2018

Shreeshrii commented May 25, 2018

Shreeshrii commented May 25, 2018

ghost commented May 31, 2018 • edited by ghost Loading

syzer commented May 31, 2018

ghost commented Jun 1, 2018 • edited by ghost Loading

Shreeshrii commented Jun 1, 2018

ghost commented Jun 1, 2018

danablanc commented Jul 23, 2018

Shugyousha commented Aug 9, 2018 • edited Loading

Shreeshrii commented Aug 9, 2018

Shugyousha commented Aug 9, 2018 via email

stweil commented Aug 10, 2018 • edited Loading

Shreeshrii commented Aug 21, 2018

stweil commented Sep 17, 2018

amitdo commented Sep 17, 2018

Shreeshrii commented Sep 17, 2018

stweil commented Oct 1, 2018

amitdo commented Oct 3, 2018

stweil commented Oct 6, 2018 • edited Loading

amitdo commented Oct 6, 2018

stweil commented Oct 6, 2018

ingwinlu commented Jun 21, 2019 • edited Loading

amitdo commented Jun 21, 2019

stweil commented Jun 21, 2019

buerge3 commented Jan 3, 2020

zdenop commented Jan 3, 2020

buerge3 commented Jan 3, 2020

zdenop commented Jan 3, 2020

buerge3 commented Jan 3, 2020

zdenop commented Jan 3, 2020

buerge3 commented Jan 3, 2020

Hemant2022 commented Dec 11, 2020

Shreeshrii commented Dec 11, 2020

sallyhill commented Nov 9, 2017 •

edited by stweil

Loading

amitdo commented Jan 8, 2018 •

edited

Loading

psinger commented Jan 9, 2018 •

edited

Loading

amitdo commented Feb 8, 2018 •

edited

Loading

ghost commented May 31, 2018 •

edited by ghost

Loading

ghost commented Jun 1, 2018 •

edited by ghost

Loading

Shugyousha commented Aug 9, 2018 •

edited

Loading

stweil commented Aug 10, 2018 •

edited

Loading

stweil commented Oct 6, 2018 •

edited

Loading

ingwinlu commented Jun 21, 2019 •

edited

Loading