-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault OCRing a washed out image #1601
Comments
Duplicate |
@konstantin-dzreev I am not able to reproduce the error regarding unichar-id and core-dump that you are getting (pasted below)
My version info is the same:
The only difference I see is that you have:
libpng version is also different. @stweil Can that make a difference? |
I found a different issue while processing this image with gdb (hoping to trace the crash). Edit: made new issue #1603 |
Related to issue #1603 - using image posted here by OP Log file attached here |
@zdenop Please reopen the issue. The title could be edited to say 'psm 6 producing gibberish' Thanks! |
@Shreeshrii: your observation is different that original issue report (that is duplication of already open issues). Renaming it will just produce chaos... |
@zdenop Good point. I will open a different issue for it and delete the comments from here. Thanks, |
OK, I tested it.
|
@amitdo tesseract and leptonica version, please! |
fast, best and the original (lstm+legacy) tessdata does not crash. |
which version is does it correspond to current tessdata? |
|
The 2 latest commits both crash. |
of tessdata? |
tessdata repov1 v2 v3 |
@stweil had removed cube components from some traineddata files recently.
It is possible that the components say 'cube' but are used by the legacy
engine.
I just refreshed my tessdata files, will test again.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Fri, May 25, 2018 at 9:43 PM, Amit D. ***@***.***> wrote:
tessdata repo
v1
Nov 28, 2016
https://github.com/tesseract-ocr/tessdata/blob/
4592b8d453889181e01982d22328b5846765eaad/eng.traineddata
Does not crash.
v2
March 22, 2018
https://github.com/tesseract-ocr/tessdata/blob/
d87b3cbc75555bd3282e0cadab5e159e2d468396/eng.traineddata
Crash!
v3
May 10, 2018
https://github.com/tesseract-ocr/tessdata/blob/
c2b2e0df86272ce11be323f23f96cf656565ed41/eng.traineddata
Crash!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1601 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o-ZUAmT1R0o-CrtCaGrfX7iNlDSwks5t2C2TgaJpZM4UNDd1>
.
|
If I use the manually binarized image I provided earlier, with those 2 newer traineddata from the tessdata repo, then there is no crash |
OK. I can reproduce the crash after updating the tessdata files. Here is
the backtrace.
Starting program: /usr/local/bin/tesseract 1601.png - --tessdata-dir
../tessdata
[Thread debugging using libthread_db enabled]
Using host libthread_db library
"/lib/powerpc64le-linux-gnu/libthread_db.so.1".
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
[New Thread 0x3fffb6c5f100 (LWP 8284)]
[New Thread 0x3fffb645f100 (LWP 8285)]
[New Thread 0x3fffb5c5f100 (LWP 8286)]
contains_unichar_id(unichar_id):Error:Assert failed:in file
../../src/ccutil/unicharset.h, line 511
Thread 1 "tesseract" received signal SIGSEGV, Segmentation fault.
0x00003fffb7bea350 in ERRCODE::error (this=<optimized out>,
caller=<optimized out>, action=<optimized out>,
format=0x3fffb7c045d0 "in file %s, line %d") at errcode.cpp:86
86 if (!*p)
(gdb) backtrace
#0 0x00003fffb7bea350 in ERRCODE::error (this=<optimized out>,
caller=<optimized out>, action=<optimized out>,
format=0x3fffb7c045d0 "in file %s, line %d") at errcode.cpp:86
#1 0x00003fffb7b3c7bc in UNICHARSET::get_isdigit (unichar_id=112,
this=0x102a95c0) at ../../src/ccutil/unicharset.h:511
#2 tesseract::Dict::char_for_dawg (dawg=0x113d11b0, ch=<optimized out>,
this=0x1058cb30) at dict.h:434
#3 tesseract::Dict::def_letter_is_okay (this=0x1058cb30,
void_dawg_args=0x3fffffffdf50, unichar_id=<optimized out>,
word_end=<optimized out>) at dict.cpp:413
#4 0x00003fffb7b3cf9c in tesseract::Dict::valid_word (this=0x1058cb30,
word=..., numbers_ok=<optimized out>) at dict.cpp:758
#5 0x00003fffb7af4324 in tesseract::Dict::valid_word (word=...,
this=<optimized out>) at ../../src/dict/dict.h:463
#6 tesseract::Wordrec::dict_word (this=<optimized out>, word=...) at
tface.cpp:129
#7 0x00003fffb7a37ee8 in tesseract::Tesseract::recog_word
(this=0x102442c0, word=0x12418860) at tfacepp.cpp:69
#8 0x00003fffb7a25554 in tesseract::Tesseract::tess_segment_pass_n
(this=0x102442c0, pass_n=<optimized out>, word=0x12418860)
at tessbox.cpp:49
#9 0x00003fffb79e01c4 in tesseract::Tesseract::match_word_pass_n
(this=<optimized out>, pass_n=<optimized out>,
word=<optimized out>, row=<optimized out>, block=<optimized out>) at
control.cpp:1580
#10 0x00003fffb79e0464 in tesseract::Tesseract::classify_word_pass1
(this=0x102442c0, word_data=..., in_word=<optimized out>,
out_words=<optimized out>) at control.cpp:1392
#11 0x00003fffb79e195c in tesseract::Tesseract::RetryWithLanguage
(this=0x102442c0, word_data=..., recognizer=<optimized out>,
debug=<optimized out>, in_word=0x1177aec0, best_words=0x3fffffffe508)
at control.cpp:899
#12 0x00003fffb79e21c0 in tesseract::Tesseract::classify_word_and_language
(this=0x102442c0, pass_n=<optimized out>,
pr_it=0x3fffffffe710, word_data=0x1177a888) at control.cpp:1315
#13 0x00003fffb79e5974 in tesseract::Tesseract::RecogAllWordsPassN
(this=0x102442c0, pass_n=<optimized out>, monitor=0x0,
pr_it=0x3fffffffe710, words=0x3fffffffe6f0) at control.cpp:266
#14 0x00003fffb79e7660 in tesseract::Tesseract::recog_all_words
(this=0x102442c0, page_res=0x1174fe20, monitor=0x0,
target_word_box=0x0, word_config=0x0, dopasses=<optimized out>) at
control.cpp:353
#15 0x00003fffb79c9b68 in tesseract::TessBaseAPI::Recognize
(this=0x10020270 <main::api>, monitor=0x0) at baseapi.cpp:870
#16 0x00003fffb79ca10c in tesseract::TessBaseAPI::ProcessPage
(this=0x10020270 <main::api>, pix=0x1027aa60,
page_index=<optimized out>, filename=<optimized out>, retry_config=0x0,
timeout_millisec=<optimized out>, renderer=0x1029de40)
at baseapi.cpp:1176
#17 0x00003fffb79cdaf0 in tesseract::TessBaseAPI::ProcessPagesInternal
(this=0x10020270 <main::api>,
filename=0x3ffffffff792 "1601.png", retry_config=0x0,
timeout_millisec=<optimized out>, renderer=0x1029de40) at baseapi.cpp:1132
#18 0x00003fffb79ce0d8 in tesseract::TessBaseAPI::ProcessPages
(this=<optimized out>, filename=<optimized out>,
retry_config=<optimized out>, timeout_millisec=<optimized out>,
renderer=<optimized out>) at baseapi.cpp:1032
#19 0x0000000010002d6c in main (argc=<optimized out>, argv=0x3ffffffff4b8)
at tesseractmain.cpp:547
(gdb) quit
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Fri, May 25, 2018 at 9:54 PM, Amit D. ***@***.***> wrote:
If I use the manually binarized image I provided earlier, with those 2
newer traineddata from the tessdata repo, then there is no crash
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1601 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o_RNGvTIWezGJS_OjosAWwSCE5gIks5t2DBYgaJpZM4UNDd1>
.
|
The version without cube is one of the two that crashes. |
d87b3cb
<tesseract-ocr/tessdata@d87b3cb>
on Mar 22
[image: @Shreeshrii] <https://github.com/Shreeshrii> Shreeshrii
<https://github.com/Shreeshrii> Update LSTM Models to integerized
tessdata_best for files < 25mb
<tesseract-ocr/tessdata@d87b3cb>
Will have to check that one.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Fri, May 25, 2018 at 9:58 PM, Amit D. ***@***.***> wrote:
The version wIthout cube is one of the two that crashes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1601 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o6oi2DK1f1531DdKg7o_VEJvRxV4ks5t2DFJgaJpZM4UNDd1>
.
|
OK, looks like that both --oem 0 and --oem 1 work individually with the
current traineddata.
However, if --oem 2 is used it crashes.
tesseract 1601.png - --tessdata-dir ../tessdata --oem 2
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
contains_unichar_id(unichar_id):Error:Assert failed:in file
../../src/ccutil/unicharset.h, line 511
Segmentation fault (core dumped)
There is no eng.config file. So by default tesseract is using both.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Fri, May 25, 2018 at 10:05 PM, ShreeDevi Kumar <shreeshrii@gmail.com>
wrote:
> d87b3cb
<tesseract-ocr/tessdata@d87b3cb>
on Mar 22
[image: @Shreeshrii] <https://github.com/Shreeshrii> Shreeshrii
<https://github.com/Shreeshrii> Update LSTM Models to integerized
tessdata_best for files < 25mb
<tesseract-ocr/tessdata@d87b3cb>
Will have to check that one.
ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Fri, May 25, 2018 at 9:58 PM, Amit D. ***@***.***> wrote:
> The version wIthout cube is one of the two that crashes.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1601 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AE2_o6oi2DK1f1531DdKg7o_VEJvRxV4ks5t2DFJgaJpZM4UNDd1>
> .
>
|
Yes, with |
unicharset and lstm-unicharset can be different in the same traineddata
file. It is possible that the program is using the wrong unicharset for the
dawg files when using --oem 2.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Fri, May 25, 2018 at 10:12 PM, ShreeDevi Kumar <shreeshrii@gmail.com>
wrote:
OK, looks like that both --oem 0 and --oem 1 work individually with the
current traineddata.
However, if --oem 2 is used it crashes.
tesseract 1601.png - --tessdata-dir ../tessdata --oem 2
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
contains_unichar_id(unichar_id):Error:Assert failed:in file
../../src/ccutil/unicharset.h, line 511
Segmentation fault (core dumped)
There is no eng.config file. So by default tesseract is using both.
ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Fri, May 25, 2018 at 10:05 PM, ShreeDevi Kumar ***@***.***>
wrote:
> > d87b3cb
> <tesseract-ocr/tessdata@d87b3cb>
> on Mar 22
> [image: @Shreeshrii] <https://github.com/Shreeshrii> Shreeshrii
> <https://github.com/Shreeshrii> Update LSTM Models to integerized
> tessdata_best for files < 25mb
> <tesseract-ocr/tessdata@d87b3cb>
>
> Will have to check that one.
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Fri, May 25, 2018 at 9:58 PM, Amit D. ***@***.***>
> wrote:
>
>> The version wIthout cube is one of the two that crashes.
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> <#1601 (comment)>,
>> or mute the thread
>> <https://github.com/notifications/unsubscribe-auth/AE2_o6oi2DK1f1531DdKg7o_VEJvRxV4ks5t2DFJgaJpZM4UNDd1>
>> .
>>
>
>
|
Should we disable --oem 2? That will also take care of other issues related to multi-language processing, where one language may only have LSTM model (Indic, Arabic) and others may have both, which also leads to similar problems. It is not necessary that --oem 2 gives better / more accurate results. |
See #235 (comment) regarding --oem 2 issues with mix of languages. |
It indeed looks like the wrong unicharset is used ( The crash is not related to the cube removal. |
I wonder why we need more than one unicharset and more than one word list. Both should not depend on the OCR engine used, and it should be possible to always use a superset fitting both engines. That would also reduce the trainedata size. |
@stweil Yes, you are right. I was trying to recall from memory recent commits. Further testing indicated problem might be unicharsets.
If you can indicate what testing will help, I can do it.
My guess is that the language models depend on these, specially the LSTM model, which also uses a recoder/unicharcompressor for some languages. Using a different unicharset (even same unichars but different order in file) lead to wrong results.
You could give it try. Use |
I'm playing with tesseract trying to process bad images like really dark, or light, or the ones with very low contrast, etc. And I run into a file that causes tesseract to die with a segmentation fault error.
Environment
Linux <host-name> 4.15.0-22-generic #24-Ubuntu SMP Wed May 16 12:15:17 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux (Ubuntu 18.04)
Current Behavior: Segmentation fault:
Attachment: a file to reproduce the issue
The text was updated successfully, but these errors were encountered: