Tesseract Empty Page #3021

M3ssman · 2020-06-16T08:35:59Z

Environment

Tesseract Version: tesseract 4.1.1-rc2-21-gf4ef
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
Platform: Ubuntu 18.04 LTS
Model Configs tested: frk, Fraktur (from tessdata_best), gt4hist_5000k (gt4hist-Model with 5000k Iterations)

Current Behavior:

When using rather large uncompressed TIF-Files (ca. 80 MB) from Project "Digitalisierung historischer deutscher Zeitschriften" for about 5 Pages (or even less) of 1000 Images we get ALTO-Files missing valid OCR-Date.

When run with tesseract 0046.tif 0046 -l frk alto it only alerts Empy Page!! and exits in < 20 seconds.
0046-alto.zip
0046-tif.zip

Generated ALTO-File and TIF-Image included.

Expected Behavior:

Produce ALTO-XML with contents.

Suggested Fix:

No idea.

The text was updated successfully, but these errors were encountered:

M3ssman · 2020-06-16T08:45:41Z

The Link from send.firefox.com is active for 1 day. Afterwards it will disappear.

zdenop · 2020-06-16T09:52:27Z

Did you tried to follow documentation?

M3ssman · 2020-06-16T11:28:49Z

@zdenop We're scanning from Microfilms using QuantumScan Software and do Preprocessing with QuantumProcess which does a good job with constrasts and deskewing. Therefore I wonder why 999 Images pass, but some few don't, like this one. Could you already take a closer look at the Image? What additional Preprocessing would you suggest if any?

stweil · 2020-06-16T11:45:42Z

@M3ssman, the link is already inactive. Please attach a sample image to the issue report here.

zdenop · 2020-06-16T11:47:22Z

I just made quick test: When I removed black border and header. Then tesseract produced result (tested just with English, as I not have frk at the moment installed).
So pre-processing of such images are must or you need to implement custom layout detection.
So sake of small size this thumbnail of my testing image:

M3ssman · 2020-06-16T12:39:15Z

@zdenop Yes, many thanks, this works!
But anyway, I wonder why all the other images went fine. They are all born the same way. Never cropped, just plain TIF-files with some tuning for contrast and rotation. Tesseract shouldn't bother with borders, as it is (almost) never does.

M3ssman · 2020-06-16T12:44:19Z

@stweil Guess what! The original Image works, too, if I scaled it down 1:4!

Sorry, I didn't recognize the link is also off after a single download

-I'm uploading again right now.-

Sorry for the delay, but I run into trouble with uploads from home. Now, from office it is way faster:
0046.zip

M3ssman · 2020-06-17T08:04:15Z

@stweil Strange indeed.
I cropped only the header of the page and left left, right and bottom margins as they are and this version works fine.
0046-headless.zip

M3ssman · 2020-06-17T08:21:01Z

@zdenop Since the original image itself looks dark somehow, I called ImageMagick 6.9 to enhance contrast: convert 0046.tif -brightness-contrast 25x50 -compress none -colorspace Gray 0046-convert.tif
This version, without taking care for the borders, makes Tesseract producing not an empty page, but quite reasonable OCR-output.

zdenop · 2020-06-17T09:13:27Z

I expect that size of 0046-convert.tif is lower. Right?

M3ssman · 2020-06-17T10:30:24Z

@zdenop This is what exiftool outputs:
original: Megapixels 79.1, Image size 7477x10584, 79151624 Byte filesize
convert: Megapixels 79.1, Image size 7477x10584, 79151434 Byte filesize
So pure filesizes differs slightly.

M3ssman · 2020-06-17T10:33:33Z

@zdenop Did you run Tesseract also with the original file (without cropping or other types of preprocessing? What was the outcome?
0046-convert.zip

zdenop · 2020-06-17T10:43:38Z

Original finished quickly with "empty page" message.

stweil · 2020-06-17T20:33:03Z

The original page triggers bugs which can be shown by adding -c textord_debug_bugs=1. Tesseract creates boxes (bounding_box_) with a right margin which exceeds the image dimensions (error message Made partition with bad right coord). Those boxes are therefore disregarded. With the following hack the boxes are processed, and text is recognized:

diff --git a/src/textord/colpartition.cpp b/src/textord/colpartition.cpp
index 74f1b1d9..465a1f57 100644
--- a/src/textord/colpartition.cpp
+++ b/src/textord/colpartition.cpp
@@ -353,7 +353,7 @@ bool ColPartition::IsLegal() {
       tprintf("Margins invalid\n");
       Print();
     }
-    return false;  // Margins invalid.
+//    return false;  // Margins invalid.
   }
   if (left_key_ > BoxLeftKey() || right_key_ < BoxRightKey()) {
     if (textord_debug_bugs) {

I think that the right solution would have to find out why Tesseract creates bad bounding boxes and fix that.
Maybe it would already help to enforce boxes with valid coordinates.

M3ssman · 2020-06-18T07:00:54Z

@stweil Many Thanks!
By now I've detected already 200+ scans that are considered empty by Tesseract.
Therefore I'll try your suggestion in our ULB-Fork and report back hopefully next week!

amitdo · 2020-06-23T00:18:28Z

Please attach the image to this issue.

https://help.github.com/en/github/managing-your-work-on-github/file-attachments-on-issues-and-pull-requests

stweil · 2020-06-23T05:26:09Z

The image is rather large, too large to be attached. It's available here: https://ub-backup.bib.uni-mannheim.de/~stweil/tesseract/issues/3021/0046.png.

stweil · 2020-06-23T07:08:12Z

The bounding boxes with illegal coordinates come from rotation:

(gdb)
#1  0x00000000006311f7 in TBOX::rotate (this=0x60e0006ba830, vec=...) at ../../../src/ccstruct/rect.h:206
206	      top_right.rotate (vec);
(gdb) p vec
$22 = (const FCOORD &) @0x7fffffff8ea0: {xcoord = 0.999990165, ycoord = -0.00443068426}
(gdb) p top_right 
$23 = {xcoord = 7523, ycoord = 10551}
(gdb) p bot_left
$24 = {xcoord = 43, ycoord = 9671}

In this case vec indicates that there is nearly no rotation at all, but because of the very large value of ycoord the function ICOORD::rotate calculates a new xcoord which is clearly outside of the image. It looks like ICOORD::rotate might be wrong and need a better implementation.

The current code rotates top right and bottom left with fix point (0,0). Maybe this should be changed to fix point top left. For small coordinates that does not make a large difference, but here it is essential.

amitdo · 2020-06-23T17:54:54Z

Another command that eliminated the issue:

gm convert 3021.png -bordercolor Black -border 10x10 3021-borderb10.png

stweil · 2020-06-23T19:14:44Z

It's also sufficient to convert the image to JPEG. The basic issue remains of course and can also result in less obvious problems, for example missing text from smaller parts of a page only. I'd expect that typically in the lower left and right parts of large pages. -c textord_debug_bugs=1 should be the default until that problem is fixed.

stweil · 2020-06-23T21:15:13Z

I now tried a modified TBOX::rotate. This not only fixes the empty page problem, too, but seems to increase the amount of text which is detected at all, so it would be worth to try it also on other pages. The bad news is that the time for processing a page increases from 56 seconds to 219 seconds. Here is the code:

diff --git a/src/ccstruct/rect.h b/src/ccstruct/rect.h
index 58a867e9..e487c8c1 100644
--- a/src/ccstruct/rect.h
+++ b/src/ccstruct/rect.h
@@ -202,9 +202,13 @@ class DLLSYM TBOX  {  // bounding box
     // and top-right corners. Use rotate_large if you want to guarantee
     // that all content is contained within the rotated box.
     void rotate(const FCOORD& vec) {  // by vector
-      bot_left.rotate (vec);
+      ICOORD top_left(bot_left.x(), top_right.y());
+      bot_left -= top_left;
+      bot_left.rotate(vec);
+      bot_left += top_left;
+      top_right -= top_left;
       top_right.rotate (vec);
-      *this = TBOX (bot_left, top_right);
+      top_right += top_left;
     }
     // rotate_large constructs the containing bounding box of all 4
     // corners after rotating them. It therefore guarantees that all

stweil · 2020-07-02T13:22:27Z

@M3ssman, we also get "Empty page" errors in our newspaper, see example.

https://github.com/stweil/tesseract/tree/fix contains a patch which seems to fix the problem. Maybe it also gets more texts from other large images, but I am still not sure. For images with large width and height, old and new code can get different results. It would help if you (and others) could try the new code and compare the results with the unpatched Tesseract. If the new code never makes things worse, we could apply it.

M3ssman · 2020-07-06T08:21:04Z

@stweil Sorry for the delay!
I just took a quick shot at a single page and it did produce textlines which is per se good but forget about the quality. Tesseract is definitively not happy with this image.

I'll try to do some more testing as it affects a remarkable amount of images and report back real soon™.

The fix + textord_debug_bugs=1 produces quite a lot output. Captured into file; maybe you can get some insights.
tesseract-5.0.0-image-1681877805_J_0112_0068.log

amitdo · 2020-07-06T19:47:21Z

Another thing that will make it work is binarization.

M3ssman · 2020-07-07T08:43:19Z

For one of the problematic images I got:

/data/ocr-staging/ocr/1667524704_J_0190/0655.tif => 1667524704_J_0190_0655 => /data/ocr-staging/ocr/empty-pages/1667524704_J_0190_0655
Tesseract Open Source OCR Engine v5.0.0-alpha-754-g0838 with Leptonica
Page 1
Detected 7102 diacritics
index >= 0 && index < line_count:Error:Assert failed:in file src/textord/makerow.cpp, line 802
./tesseract-empty-pages.sh: Zeile 34: 29848 Abgebrochen             (Speicherabzug geschrieben) ${TESS_BIN} "$tiff_path" "${outpath}" --dpi 470 -l frk alto

I will skip this by now and move on.

With many other "Problem-Bilder" patched Tesseract yields:

Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box

amitdo · 2020-07-07T18:58:34Z

Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box

#427 #468 #1601

amitdo · 2020-07-07T19:01:40Z

These error messages are produced by Leptonica.

They are triggered by a call to pixClipBoxToForeground()

https://github.com/DanBloomberg/leptonica/blob/bbe289cf3f0fe368d5b9eac64df2ccd6e9b05c56/src/pix5.c#L1956

https://github.com/tesseract-ocr/tesseract/search?q=pixClipBoxToForeground

M3ssman · 2020-07-08T08:58:34Z

I've some larger tests with the patch @stweil provided, with the following results:

From 133 images

6 image produce the mentioned assertion error (index >= 0 && index < line_count:Error:Assert failed:in file src/textord/makerow.cpp, line 802).
- 4 of them are rather complex announcement pages which don't contain german fracture letters, it's all antiqua. But they were processed via "frk"-configuration.
- 1 page has a train time table included (which takes about 2/3 of the page). This tables is rotated by 90° counter-clockwise.
- 1 page needs dewarping, it appears to have a heavy rippled surface.
in addition, 13 images still yield "empty page"

I run the 6 problematic pages once more (v4.1.1-rc2-25-g9707 from alex-p with --dpi 470 -l frk alto).
This time I've got no assertion errors but 2 pages with textlines and still 4 empty pages.
After enhancing brightness (+50) contrast (between +15 and +25) also these 4 pages were processed without errors.

I'm uncertain how to deal with this.
I don't think it's a good idea to silence warnings just to have bad material passing. Second, the assertion error is only in the patched version. This error seems to be really serious, since it even halts execution of my scripts.
By now I take the blame by myself, given with advanced preprocessing tesseract produces text. Our processes watch out for those inglorious 844 byte files, but we didn't have this on our agenda before.

@stweil @amitdo @zdenop I'm fine if you close this issue, but if you'd like to, I can provide more testdata.

stweil · 2020-07-08T09:10:24Z

The "empty page" message means that Tesseract dropped all text boxes because the internal checks decided that they had coordinates which are out of bounds. This might only be the extreme variant of a general problem: maybe Tesseract also drops parts of other pages where it recognizes text, but not all.

That's why it would be important to run OCR on a larger test set with -c textord_debug_bugs=1 to see whether pages with OCR text also show error messages and whether these error messages correspond to missing text boxes on such pages.

M3ssman · 2020-07-09T09:36:15Z

@stweil
I will run the patch with the 130+ images testset and report back early next week.

M3ssman · 2020-12-21T14:03:26Z

@stweil Sorry for the delay!
Is the patched code in master branch already?

I'd like to put this issue to an end.
By now (IMHO) there are 2 different problems that we're facing here:

Tesseract produces ALTO-files missing page content.
This is a problem for Tesseract users / apps that utilize Tesseract's output.
Tesseract makes wrong decisions about data validity. This may also effect the general detection algorithm.
This is a problem both to OCR-Engineers as well as any succeeding users or applications.

To deal with 1), I would like appreciate Tesseract to write no output at all and/or print a warning to stdout.
If these options are not worth the additional efforts, please let me know. By now I'm checking the size of the ALTO XML - it works, but it feels like tampering with symptoms.

Number 2 seems to be a really big issue that cannot be solved in total right now.

Thanks for any investigations to @stweil, @zdenop and @amitdo ! All your inspections lead (IMHO) to the
category data error, since tuning the image-data (binarize, despickle, etc.) improves in all cases Tesseract's analyzis.
Therefore I consider this behavior not as an intrinsic problem of Tesseract, it's the data.

amitdo · 2021-05-14T17:22:44Z

With the code from #3418, when Sauvola binarization is used, I don't get "Empty page!!".
"

stweil · 2021-12-06T11:21:55Z

I just finished OCR with Tesseract 5.0.0 for a huge number of newpaper scans.

6587 scans of 371629 finished with "Empty page" when Tesseract used the default binarization.
105 scans of those 6587 still were "Empty page" when Tesseract was used with -c thresholding_method=2.
6 scans of those 105 still were "Empty page" when Tesseract was used with -c thresholding_method=1 (example).
The remaining 6 scans could be processed with patched Tesseract code (https://github.com/stweil/tesseract/tree/fix).

So using a different binarization helps in most cases, but not always.

amitdo · 2021-12-06T14:58:13Z

Try to convert the jp2 to png. It does not fail for me with your example and method 2.

stweil · 2021-12-06T15:49:37Z

Thank you, that's interesting. I can reproduce it, and it seems to be related to the image resolution:

The original JP2 image has 300 dpi and fails:

tesseract 0312.jp2 - -l ubma/frak2021-09 -c thresholding_method=2 -c textord_debug_bugs=1
Made partition with bad left coords, 0 > -8
ColPart: (M0-B-8-B-8/-8,5017/5017)->(3599B-3599B-3607M/3599,5071/5071) w-ok=0, v-ok=0, type=0R1, fc=-1, lc=-1, boxes=1 ts=0 bs=0 ls=0 rs=0
Margins invalid
ColPart:E(M0-B-8-B-8/-8,5017/5017)->(3599B-3599B-3607M/3599,5071/5071) w-ok=0, v-ok=0, type=0T1, fc=-1, lc=-1, boxes=0 ts=0 bs=0 ls=0 rs=0
Empty page!!
Made partition with bad left coords, 0 > -8
ColPart: (M0-B-8-B-8/-8,5017/5017)->(3599B-3599B-3607M/3599,5071/5071) w-ok=0, v-ok=0, type=0R1, fc=-1, lc=-1, boxes=1 ts=0 bs=0 ls=0 rs=0
Margins invalid
ColPart:E(M0-B-8-B-8/-8,5017/5017)->(3599B-3599B-3607M/3599,5071/5071) w-ok=0, v-ok=0, type=0T1, fc=-1, lc=-1, boxes=0 ts=0 bs=0 ls=0 rs=0
Empty page!!

Converting the JP2 to PNG with convert removes the resolution information.
Tesseract therefore guesses a resolution of 367 dpi and can process the scan:

tesseract 0312.png - -l ubma/frak2021-09 -c thresholding_method=2 -c textord_debug_bugs=1
Estimating resolution as 367
Made partition with bad right coords, 556 < 577
ColPart: (M184-T204-B207/388,4146/4152)->(577B-1252T-556M/398,4194/4179) w-ok=1, v-ok=1, type=1T4, fc=-1, lc=-1, boxes=24 ts=0 bs=0 ls=0 rs=0
[...]
genommen werden. In Beantwortung verſchledener Anfragen erklärte Amts, Wirklichen Geheimen Rats Der nburg über koloniale J güch die vornebme Aufgabe bringt, ſich des Deutſchen Reiches alt
[...]

Processing the original JP2 with an explicit resolution works, too:

tesseract 0312.jp2 - -l ubma/frak2021-09 -c thresholding_method=2 -c textord_debug_bugs=1 --dpi 400
Made partition with bad right coords, 1232 < 1243
[...]
Zenommen werden. In Beantwortung verſchledener Anfragen erklärte i ĩ ü i
[...]

wollmers · 2021-12-06T16:45:07Z

Thank you, that's interesting. I can reproduce it, and it seems to be related to the image resolution:

The original JP2 image has 300 dpi and fails:

Is it a JPX with mask layer like this https://archive.org/details/bub_gb_qmZyOar8UHwC/page/n71/mode/2up ?

Then try the mask

and negate.

CER 14.23 % is not so bad for the quality of the scan.

stweil · 2021-12-06T17:43:43Z

Where did you get CER 14.23 %?

amitdo · 2021-12-06T18:58:49Z

@stweil, GIMP reports '72 ppi' for your jp2, but as you said Tesseract see it as 300 ppi. IIRC, when GIMP does not find the ppi in the image metadata, it is reported as 72 ppi.

wollmers · 2021-12-07T07:06:19Z

Where did you get CER 14.23 %?

Good question;-) On logical page 47 of Galileos book.

My comment was meant as: If your jp2 has a mask layer, as jp2 allows many kinds of compressions, then try the mask layer.

The book exists on archive.org in two versions, scanned from two different specimens in different bad conditions:

bub_gb_7sFnWGI31XcC
bub_gb_qmZyOar8UHwC

$ pdfimages -f 69 -l 69 -list bub_gb_7sFnWGI31XcC.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
  69     0 image     600   834  rgb     3   8  jpx    no       338  0   200   201 4818B 0.3%
  69     1 image    1800  2500  rgb     3   8  jpx    no       339  0   600   600 13.7K 0.1%
  69     2 mask     1800  2500  -       1   1  jpx    no       339  0   600   600 13.7K 2.5%    <-- the mask

$ pdfimages -f 69 -l 69 bub_gb_7sFnWGI31XcC.pdf bub_gb_7sFnWGI31XcC.p0069

$ ls -la bub_gb_7sFnWGI31XcC.p0069*
bub_gb_7sFnWGI31XcC.p0069-000.ppm
bub_gb_7sFnWGI31XcC.p0069-001.ppm
bub_gb_7sFnWGI31XcC.p0069-002.pbm   <-- the mask

$ convert bub_gb_7sFnWGI31XcC.p0069-002.pbm -negate -density 600x600 -units PixelsPerInch bub_gb_7sFnWGI31XcC.p0069-002.pos.tiff

$ tesseract bub_gb_7sFnWGI31XcC.p0069-002.pos.tiff ...

If I recorded correctly (should write a script for permutations and recording them):

Latin = -l lat
GT4 = -l GT4Hist
ubma = -l ubma/frak2021_0.905_1587027_9141630

CER     variant
0.0567 bub_gb_7sFnWGI31XcC.p0069-002.pos.nopsm.ubma.txt
0.0841 bub_gb_7sFnWGI31XcC.p0069-002.pos.nopsm.GT4.txt
0.1227 bub_gb_7sFnWGI31XcC.p0069-002.pos.nopsm.Latin.txt

0.1618 bub_gb_qmZyOar8UHwC.p0058-002.pos.nopsm.ubma.txt

stweil · 2021-12-07T15:50:14Z

GIMP reports '72 ppi'

Obviously GIMP ignores the EXIF metadata. GIMP has a menu entry which shows the metadata and also the EXIF part with x/y resolutions of 300 and the resolution unit "inch". exiftool shows resolutions of 118.1102 and the resolution unit "cm" which gives the same DPI value of 300 (118.1102 * 2.54).

wollmers · 2021-12-08T07:28:34Z

AFAIK 72 ppi is the default in some image programs. In GIMP it's AFAIR default only in the GUI Image -> change resolution.

EXIF is the wrong place to specify ppi. convert ... -density 600x600 -units PixelsPerInch ... is reliable, but not all image formats can store it.

aved12 · 2022-01-05T13:17:41Z

try this code @M3ssman
"""from PIL import Image ,ImageEnhance

im = Image.open(r""+"C:\Users\user\Documents\Lightshot\stry5.png")
cness = ImageEnhance.Sharpness(im)
cFactor = 2
im = cness.enhance(cFactor)
cness = ImageEnhance.Brightness(im)
cFactor = 3
im = cness.enhance(cFactor)
im.show()
im.save(r""+"C:\Users\user\Documents\Lightshot\stry7.png",quality=95)"""

it finds blobs for all characters

stweil added the bug label Jun 17, 2020

stweil mentioned this issue Jun 30, 2020

Fix out of bounds array access #3045

Merged

stweil changed the title ~~Tesseract 4.1.1 Empty Page~~ Tesseract Empty Page Jul 2, 2020

stweil mentioned this issue Apr 6, 2021

Tesseract seemingly stuck #3377

Open

amitdo added the bounding box label Apr 6, 2021

amitdo added the binarization label May 15, 2021

stweil mentioned this issue Dec 6, 2021

layout: Empty Page output for default psm #3670

Open

This was referenced Aug 5, 2023

Tesseract creates hOCR output without text results #4112

Open

Orientation detection "asymmetrical" #4116

Open

Tesseract Empty Page #3021

Tesseract Empty Page #3021

Comments

M3ssman commented Jun 16, 2020 • edited Loading

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

M3ssman commented Jun 16, 2020

zdenop commented Jun 16, 2020

M3ssman commented Jun 16, 2020

stweil commented Jun 16, 2020

zdenop commented Jun 16, 2020

M3ssman commented Jun 16, 2020

M3ssman commented Jun 16, 2020 • edited Loading

M3ssman commented Jun 17, 2020

M3ssman commented Jun 17, 2020

zdenop commented Jun 17, 2020

M3ssman commented Jun 17, 2020

M3ssman commented Jun 17, 2020

zdenop commented Jun 17, 2020

stweil commented Jun 17, 2020 • edited Loading

M3ssman commented Jun 18, 2020

amitdo commented Jun 23, 2020

stweil commented Jun 23, 2020

stweil commented Jun 23, 2020

amitdo commented Jun 23, 2020

stweil commented Jun 23, 2020 • edited Loading

stweil commented Jun 23, 2020 • edited Loading

stweil commented Jul 2, 2020

M3ssman commented Jul 6, 2020

amitdo commented Jul 6, 2020

M3ssman commented Jul 7, 2020 • edited Loading

amitdo commented Jul 7, 2020

amitdo commented Jul 7, 2020 • edited Loading

M3ssman commented Jul 8, 2020

stweil commented Jul 8, 2020

M3ssman commented Jul 9, 2020

M3ssman commented Dec 21, 2020

amitdo commented May 14, 2021

stweil commented Dec 6, 2021 • edited Loading

amitdo commented Dec 6, 2021

stweil commented Dec 6, 2021

wollmers commented Dec 6, 2021

stweil commented Dec 6, 2021

amitdo commented Dec 6, 2021

wollmers commented Dec 7, 2021

stweil commented Dec 7, 2021

wollmers commented Dec 8, 2021

aved12 commented Jan 5, 2022

M3ssman commented Jun 16, 2020 •

edited

Loading

M3ssman commented Jun 16, 2020 •

edited

Loading

stweil commented Jun 17, 2020 •

edited

Loading

stweil commented Jun 23, 2020 •

edited

Loading

stweil commented Jun 23, 2020 •

edited

Loading

M3ssman commented Jul 7, 2020 •

edited

Loading

amitdo commented Jul 7, 2020 •

edited

Loading

stweil commented Dec 6, 2021 •

edited

Loading