Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error opening data file ".traineddata" when two languages defined and first has a tessedit_load_sublangs param #4002

Closed
AndrewG10i opened this issue Jan 25, 2023 · 11 comments
Assignees

Comments

@AndrewG10i
Copy link

AndrewG10i commented Jan 25, 2023

Basic Information

tesseract 5.3.0
leptonica-1.83.0
libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.4) : libpng 1.6.39 : libtiff 4.5.0 : zlib 1.2.13 : libwebp 1.2.4 : libopenjp2 2.5.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1

Operating System

RHEL 8

Other Operating System

CentOS 8 Stream x86_64 with all updates.

uname -a

Linux dev1.local 4.18.0-448.el8.x86_64 #1 SMP Wed Jan 18 15:02:46 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Compiler

N/A

Virtualization / Containers

VMWare Workstation 16

CPU

Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz

Current Behavior

When tesseract set to use two languages (e.g. -l chi_tra+eng) and first language traineddata contains tessedit_load_sublangs param (which should not be equal 'eng') error notification Error opening data file /local/tessData/tessdata_best-4.1.0/.traineddata is shown as follows:

[root@dev1/tmp/tess]# ./tesseract sample01s.png sample01s.png.txt --oem 1 -l chi_tra+eng
Error opening data file /local/tessData/tessdata_best-4.1.0/.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language ''
Estimating resolution as 327

Sample file is not related to the issue, so any sample can be used

Expected Behavior

Command should be executed without warning, so languages should be loaded properly. The referenced case above works fine ( without Error opening data file message) by swapping languages order:

[root@dev1 tess]# ./tesseract sample01s.png sample01s.png.txt --oem 1 -l eng+chi_tra
Estimating resolution as 327

Suggested Fix

Unfortunately I am not a C-guy, but seems something is wrong around the following lines of code:
https://github.com/tesseract-ocr/tesseract/blob/5.3.0/src/ccmain/tessedit.cpp#L295

Other Information

I have noticed that there were quite a few issues related to languages load lately (just for ref):

This issue comes from the another issue: Tesseract crash after glibc update on linux (when two languages selected)

Tested with versions 5.2 and 5.3 and for both issue repeats, also tested with version 4.1.1 and it works properly there.

@amitdo amitdo changed the title Error opening data file ".traineddata" when two languages defined and first has a tessedit_load_sublangs param (tested with 5.2 and 5.3) Error opening data file ".traineddata" when two languages defined and first has a tessedit_load_sublangs param Jan 25, 2023
@AndrewG10i
Copy link
Author

Hello team, in case there is any possibility to include such a tiny fix into next release - that would be highly appreciated, as this bug causes critical issue (crash) in our usage scenario. Thank you so much in advance!

@amitdo
Copy link
Collaborator

amitdo commented May 3, 2023

@stweil, I hope you can fix this regression.

@stweil stweil self-assigned this May 3, 2023
@AndrewG10i
Copy link
Author

...great to see it assigned, hopefully it can be resolved soon! :)

@stweil
Copy link
Contributor

stweil commented Oct 5, 2023

I don't see a tessedit_load_sublangs parameter for chi_tra from tessdata_fast, but tessdata_best has that parameter.

@stweil
Copy link
Contributor

stweil commented Oct 5, 2023

It looks like commit 9091055 tried to fix loading of sublangs, but instead of that broke it completely. So there was no longer a warning message, but the sublangs were simply not loaded.

This regression should affect 5.0.0-rc2 and all following releases. Therefore I wonder how 5.0.0 could produce the warning.

@stweil
Copy link
Contributor

stweil commented Oct 5, 2023

The regression (and this issue here) should be fixed by pull request #4141.

@AndrewG10i
Copy link
Author

AndrewG10i commented Oct 5, 2023

I don't see a tessedit_load_sublangs parameter for chi_tra from tessdata_fast, but tessdata_best has that parameter.

Yes, I am using tessdata_best, didn't mention that clearly, but can be seeing from the output log. Thanks!

The regression (and this issue here) should be fixed by pull request #4141.

Great if it is so! Is there any way to confirm that? ;)

@stweil
Copy link
Contributor

stweil commented Oct 5, 2023

I'm afraid the only way to confirm that is currently using your own build of Tesseract with the patched code.

@tfmorris
Copy link
Contributor

@AndrewG10i Now that 5.3.3 has been released, you should be able to verify the fix.

@AndrewG10i
Copy link
Author

@AndrewG10i Now that 5.3.3 has been released, you should be able to verify the fix.

Thank you so much, guys! I will test it as soon as I can and reply back here!

@AndrewG10i
Copy link
Author

Tested this issue and it is resolved now! Sorry for delay with my reply! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants