-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some images translated to text using Tesseract 4 throw an error regarding "contains_unichar_id" #1205
Comments
Many other couple thousand images going in worked well, but these images were the error images. |
Any solution to that? Having similar issues. |
@sallyhill , @psinger
? |
I just tried it, and it works with both options. Any idea what's going on? |
Seems like a bug in combining the two OCR engines. |
Any way to track this down further? |
You can use GDB to see the function call chain. Frankly, I only use --oem 1 (or 3 with best/fast traineddata), so I'm not so motivated to invest time on this issue. Sorry. |
👍 |
I get the reported assertion with the second image (all other images work for me) and will have a look. |
New report #1601 |
@zdenop Please label as bug. |
have you found any solution for this? my pdf has Arabic and English both. I'm facing the same issue.contains_unichar_id(unichar_id):Error:Assert failed:in file c:\projects\github\tesseract-ocr\src\ccutil\unicharset.h, line 511 |
yeah.. i made a patch for it,\that removes this assert.. it's kinda ok'ish |
Thanks syzer. from where I can get the patch. Please share. |
Please see #1286 It has not been merged yet. If you try it please provide feedback. |
Please publish one standard jar file, so that we can explore it. And could you please guide me to create traineddata file. thanks |
Hi. |
I can reproduce this and since I haven't seen a stack trace for this yet I will post the one I have:
Looks like the unicode point being provided to get_isdigit is not a valid digit and hits the assertion. Not sure how and why we end up there though. |
Please check the version of traineddata file that you are using. Also try with traineddata from tessdata_fast and tessdata_best. Do you get the same error? |
On Thu, Aug 09, 2018 at 07:34:40AM -0700, Shreeshrii wrote:
Please check the version of traineddata file that you are using.
I used an about 2 week old version of the models in the tesseract-data
github repo.
Also try with traineddata from tessdata_fast and tessdata_best. Do you
get the same error?
Sadly I don't have access to the installation at the moment because I
am off work and will be going on holiday tomorrow. I will make a note
in my calendar to check this after I am back.
Cheers,
Silvan
|
The issue only occurs with models from |
That commit 'Updated LSTM Models to integerized tessdata_best'. The earlier commit by Ray was on Nov 29, 2016 However, after that the format of traineddata files has changed to include the recoder. If I remember correctly, those LSTM models do not work/produce accurate recognition results with current code. 2017-07-14 (dc8745e) Ray Smith: Move LSTM unicharset and recoder to traineddata with version string part1. Backwards compatible - maybe. |
I consider this to be one of the most important bugs which I'd like to get fixed for 4.0.0, even if it only occurs with models from https://github.com/tesseract-ocr/traineddata when both old and new OCR engine are used (which is still the default). Several possible solutions exist:
|
Should be: |
It will be helpful if @jbreiden can check whether this error also happens with Google's version of tesseract. |
See discussion #1849 with some ideas for workaround solutions. |
@stweil, since we want to release 4.0.0 in the next 2-3 weeks and we still don't have a fix for this issue, I think we need to move to plan B (make a workaround). |
We don't. I found a fix today. See pull request #1954. |
Thanks! I assume it also solves the other similar reports, right? |
Yes, I assume so. @sallyhill, @psinger please test the new code. |
unfortunatly this issue still persists with releases containing the above bugfix (4.0.0 on archlinux)
bad news is that I can not share the file causing it. |
Try using |
@ingwinlu, it would help to have a reproducible test case. Perhaps you can find a shareable image, or you can send me your image via e-mail. |
Your tesseract version is very very old. Use the latest code when dealing with issue. |
you wrote:
|
yes, that is the error i am getting. I could not find any instructions for installing Tesseract on RedHat, so I used the instructions given by this guy's blog: |
If you get that error you are not using the latest code/version. And it is not tesseract issue. |
I uninstalled tessaract and reinstalled it using the instructions given here: https://github.com/tesseract-ocr/tesseract/wiki |
I am getting same error even when I try to use no config. Is this issue still closed?? |
Please post tesseract version, which traineddata you used and the image giving error. |
Environment
Files affected:
Current Behavior:
text to string of these images throws a TesseractError that prints: (-6, 'contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513') on the attached files
Expected Behavior:
No error.
Suggested Fix:
I am not sure. Right now I'm just running pytesseract.image_to_string in a try block
The text was updated successfully, but these errors were encountered: