-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract Empty Page #3021
Comments
The Link from send.firefox.com is active for 1 day. Afterwards it will disappear. |
Did you tried to follow documentation? |
@zdenop We're scanning from Microfilms using QuantumScan Software and do Preprocessing with QuantumProcess which does a good job with constrasts and deskewing. Therefore I wonder why 999 Images pass, but some few don't, like this one. Could you already take a closer look at the Image? What additional Preprocessing would you suggest if any? |
@M3ssman, the link is already inactive. Please attach a sample image to the issue report here. |
@zdenop Yes, many thanks, this works! |
@stweil Strange indeed. |
@zdenop Since the original image itself looks dark somehow, I called ImageMagick 6.9 to enhance contrast: |
I expect that size of 0046-convert.tif is lower. Right? |
@zdenop This is what exiftool outputs: |
@zdenop Did you run Tesseract also with the original file (without cropping or other types of preprocessing? What was the outcome? |
Original finished quickly with "empty page" message. |
The original page triggers bugs which can be shown by adding
I think that the right solution would have to find out why Tesseract creates bad bounding boxes and fix that. |
@stweil Many Thanks! |
Please attach the image to this issue. |
The image is rather large, too large to be attached. It's available here: https://ub-backup.bib.uni-mannheim.de/~stweil/tesseract/issues/3021/0046.png. |
The bounding boxes with illegal coordinates come from rotation:
In this case The current code rotates top right and bottom left with fix point (0,0). Maybe this should be changed to fix point top left. For small coordinates that does not make a large difference, but here it is essential. |
Another command that eliminated the issue:
|
It's also sufficient to convert the image to JPEG. The basic issue remains of course and can also result in less obvious problems, for example missing text from smaller parts of a page only. I'd expect that typically in the lower left and right parts of large pages. |
I now tried a modified
|
@M3ssman, we also get "Empty page" errors in our newspaper, see example. https://github.com/stweil/tesseract/tree/fix contains a patch which seems to fix the problem. Maybe it also gets more texts from other large images, but I am still not sure. For images with large width and height, old and new code can get different results. It would help if you (and others) could try the new code and compare the results with the unpatched Tesseract. If the new code never makes things worse, we could apply it. |
@stweil Sorry for the delay! I'll try to do some more testing as it affects a remarkable amount of images and report back real soon™. The fix + |
Another thing that will make it work is binarization. |
For one of the problematic images I got:
I will skip this by now and move on. With many other "Problem-Bilder" patched Tesseract yields:
|
These error messages are produced by Leptonica. They are triggered by a call to pixClipBoxToForeground() https://github.com/tesseract-ocr/tesseract/search?q=pixClipBoxToForeground |
I've some larger tests with the patch @stweil provided, with the following results: From 133 images
I run the 6 problematic pages once more (v4.1.1-rc2-25-g9707 from alex-p with I'm uncertain how to deal with this. @stweil @amitdo @zdenop I'm fine if you close this issue, but if you'd like to, I can provide more testdata. |
The "empty page" message means that Tesseract dropped all text boxes because the internal checks decided that they had coordinates which are out of bounds. This might only be the extreme variant of a general problem: maybe Tesseract also drops parts of other pages where it recognizes text, but not all. That's why it would be important to run OCR on a larger test set with |
@stweil |
@stweil Sorry for the delay! I'd like to put this issue to an end.
To deal with 1), I would like appreciate Tesseract to write no output at all and/or print a warning to stdout. Number 2 seems to be a really big issue that cannot be solved in total right now. Thanks for any investigations to @stweil, @zdenop and @amitdo ! All your inspections lead (IMHO) to the |
With the code from #3418, when Sauvola binarization is used, I don't get "Empty page!!". |
I just finished OCR with Tesseract 5.0.0 for a huge number of newpaper scans.
So using a different binarization helps in most cases, but not always. |
Try to convert the jp2 to png. It does not fail for me with your example and method 2. |
Thank you, that's interesting. I can reproduce it, and it seems to be related to the image resolution: The original JP2 image has 300 dpi and fails:
Converting the JP2 to PNG with
Processing the original JP2 with an explicit resolution works, too:
|
Is it a JPX with mask layer like this https://archive.org/details/bub_gb_qmZyOar8UHwC/page/n71/mode/2up ? Then try the mask and negate. CER 14.23 % is not so bad for the quality of the scan. |
Where did you get CER 14.23 %? |
@stweil, GIMP reports '72 ppi' for your jp2, but as you said Tesseract see it as 300 ppi. IIRC, when GIMP does not find the ppi in the image metadata, it is reported as 72 ppi. |
Good question;-) On logical page 47 of Galileos book. My comment was meant as: If your jp2 has a mask layer, as jp2 allows many kinds of compressions, then try the mask layer. The book exists on archive.org in two versions, scanned from two different specimens in different bad conditions:
If I recorded correctly (should write a script for permutations and recording them):
|
Obviously GIMP ignores the EXIF metadata. GIMP has a menu entry which shows the metadata and also the EXIF part with x/y resolutions of 300 and the resolution unit "inch". |
AFAIK EXIF is the wrong place to specify ppi. |
try this code @M3ssman im = Image.open(r""+"C:\Users\user\Documents\Lightshot\stry5.png") it finds blobs for all characters |
Environment
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
frk
,Fraktur
(fromtessdata_best
),gt4hist_5000k
(gt4hist-Model with 5000k Iterations)Current Behavior:
When using rather large uncompressed TIF-Files (ca. 80 MB) from Project "Digitalisierung historischer deutscher Zeitschriften" for about 5 Pages (or even less) of 1000 Images we get ALTO-Files missing valid OCR-Date.
When run with
tesseract 0046.tif 0046 -l frk alto
it only alertsEmpy Page!!
and exits in < 20 seconds.0046-alto.zip
0046-tif.zip
Generated ALTO-File and TIF-Image included.
Expected Behavior:
Produce ALTO-XML with contents.
Suggested Fix:
No idea.
The text was updated successfully, but these errors were encountered: