-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract seemingly stuck #3377
Comments
fyi - 179MB sized image |
Right, sorry for not mentioning that. I could share the original JPEG2000 image if that is preferred. We process a lot of images at this size, (let's say probably 100,000 at this point) and very few fail this way. (At least this one, potentially two more) |
It's hard to say where it's stuck or spends most of the time. Probably this could be profiled. Maybe it is just the big image. |
Related issue: #3369. Tesseract shows that behaviour for images where it "detects" a huge number of boxes. Some parts of the layout detection seem to require time which increases with the square of that number. The critical code finds and inserts into an unordered set. We observe sometimes images which need more than an hour, too. Maybe the image here is a similar case. I'll run a test to see whether the OCR terminates. |
For this specific image, I believe I've let it run for a about a day. There a few images that precede this one, but they usually take 1.5 minutes, so the rest of the ~24 hours is for this one image. I believe the one reason it dies is memory exhaustion - but that is a guess. Note that this run was not done with latest master, but using the 20201231 snapshot with one additional hOCR patch added.
|
I am not sure if it is helpful, but I could surface the other images that have similar problems. |
You can use those to test a fix (as soon as we have one), but I don't need more images for this issue. My first test was killed by the Linux kernel after 75 minutes because Tesseract's memory usage increased continuously to more than 6 GiB (I had no swap space provided, and running three similar processes was simply too much for 16 GiB RAM). So the image here not only consumes much time (I still think OCR will finish finally) but also much memory. Maybe in your case the OCR was also stopped because of out-of-memory. Running A 2nd test was running 5 hours before it again was killed using about 10 GB RAM:
|
I already tried that, and it does not change the performance. A simplified custom hash function (without the division) also had no effect on the performance. I also tried using a sorted set instead of the unsorted one. That slightly increased the execution time. |
|
This is from 2008. |
The Tesseract OCR terminates after running several days and using 16 GB or more RAM with a surprising result:
See also issue #3021 which reports full newspaper pages where Tesseract does not detect any text. |
It says that using std:list instead of the (intrusive) c lists will result in much slower code, Ignore the part that rules out any use of the STL, which is outdated. |
The Tesseract lists ( |
First thing is to replace those list macros with templates. |
I found that using Sauvola thresholding solves the problem for this image - it's possible that the Otsu thresholding just makes such a mess of the image that the segmenter has tremendous trouble interpreting the image. You can find the thresholded image here: https://archive.org/~merlijn/tesseract-images/sim_new-york-times_1900-01-11_49_15-603_0008_thresholded.png (1.6MB) The runtime on my machine (Tesseract 4, stable) was just under four minutes:
I've ported a low-memory and fast Sauvola thresholding algorithm from this paper: https://arxiv.org/pdf/1905.13038.pdf and will start looking into making it possible for Tesseract to use that thresholding instead (per #3083 ). So perhaps once selectable binarisation is in place, this issue can be resolved. |
Did you tried pixSauvolaBinarize from leptonica? |
Yes, I have experimented with that method too, but the binarisation step uses more ram (3.3GB vs 660MB). Tesseract finished in about 5-6 minutes using the leptonica Sauvola binarised image -- depending on the Sauvola parameters, of course. |
To be clear, my experiments are running Tesseract on an already binarised image (either made using the code I mentioned above, or the using the leptonica sauvola binarise). I know that is not ultimately how people should run Tesseract (on a binarised image for OCR quality purposes), but for the purpose of testing if it fixes this bug, it was easier. I suspect that adding alternative binarisation to Tesseract (e.g. the leptonica binarise, or the one I wrote based on the paper) will also solve this problem on a non-binarised version of this image. |
IMO this is exactly how tesseract should be run. Problem is that most of users want to OCR colourful images and they do not care about binarization, so tesseract is providing Otsu, that should work on most cases... |
Understood, thanks. I remember (I don't know where) that the LSTM engine would potentially work better on grayscale images than binarised images. I'll look into adding Sauvola binarisation using leptonica's method to Tesseract, and then see if that opens up ways to add other binarisation methods. |
Leptonica has other binarization methods. http://www.cvc.uab.es/icdar2009/papers/3725b375.pdf ICDAR 2009 Document Image Binarization Contest (DIBCO 2009)
33c - 7th place, 33b - 11th place |
Cool - seems like worth checking out when working on adding Sauvola. I went with Sauvola after experimenting (and evaluating) with all the thresholding algorithms present in scikit-image (https://scikit-image.org/docs/dev/api/skimage.filters.html), in particular this note (and the paper): "This algorithm is originally designed for text recognition." I didn't evaulate the methods for the purpose of OCRing, though, but rather for the purpose of creating masks of the text (and lines in photos/images) for MRC compression. |
More methods with open source implementations: https://github.com/ocropus/ocropy/blob/master/ocropus-gpageseg |
gamera (python framework for building document analysis applications) has also bunch of implementation of binarization. ImageJ (java image processing program designed for scientific multidimensional images) has Auto Threshold plugin with several other methods. Both projects use GPL3 licence, so we can not do copy&paste. |
With the code from #3418, the processing ends after 4:30 minutes, when Sauvola binarization is used. The output looks good. Note that the image size is equivalent to 7 A4 pages, so the processing time is 38 second per page. With adaptive Otsu I get 'Empty page!' after 36 seconds. |
The legacy We need to limit the maximum image size in pixels (to |
Here is another image which absolutely wrecks Tesseract: |
Environment
master
23ed59bd7bca777e4e104c4ee540843373aa9869
Linux gentoo-x13 5.11.7-gentoo-dist #1 SMP Wed Mar 17 21:03:41 -00 2021 x86_64 AMD Ryzen 7 PRO 4750U with Radeon Graphics AuthenticAMD GNU/Linux
Current Behavior:
Tesseract hangs, seemingly never finishes
Expected Behavior:
Tesseract doesn't hang and produces output normally
GDB backtrace (interrupted after more than 5 minutes):
Image: https://archive.org/~merlijn/tesseract-images/sim_new-york-times_1900-01-11_49_15-603_0008.ppm
The text was updated successfully, but these errors were encountered: