Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract Empty Page #3021

Open
M3ssman opened this issue Jun 16, 2020 · 42 comments
Open

Tesseract Empty Page #3021

M3ssman opened this issue Jun 16, 2020 · 42 comments

Comments

@M3ssman
Copy link
Contributor

M3ssman commented Jun 16, 2020

Environment

  • Tesseract Version: tesseract 4.1.1-rc2-21-gf4ef
    leptonica-1.78.0
    libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
    Found AVX2
    Found AVX
    Found FMA
    Found SSE
    Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
  • Platform: Ubuntu 18.04 LTS
  • Model Configs tested: frk, Fraktur (from tessdata_best), gt4hist_5000k (gt4hist-Model with 5000k Iterations)

Current Behavior:

When using rather large uncompressed TIF-Files (ca. 80 MB) from Project "Digitalisierung historischer deutscher Zeitschriften" for about 5 Pages (or even less) of 1000 Images we get ALTO-Files missing valid OCR-Date.

When run with tesseract 0046.tif 0046 -l frk alto it only alerts Empy Page!! and exits in < 20 seconds.
0046-alto.zip
0046-tif.zip

Generated ALTO-File and TIF-Image included.

Expected Behavior:

Produce ALTO-XML with contents.

Suggested Fix:

No idea.

@M3ssman
Copy link
Contributor Author

M3ssman commented Jun 16, 2020

The Link from send.firefox.com is active for 1 day. Afterwards it will disappear.

@zdenop
Copy link
Contributor

zdenop commented Jun 16, 2020

Did you tried to follow documentation?

@M3ssman
Copy link
Contributor Author

M3ssman commented Jun 16, 2020

@zdenop We're scanning from Microfilms using QuantumScan Software and do Preprocessing with QuantumProcess which does a good job with constrasts and deskewing. Therefore I wonder why 999 Images pass, but some few don't, like this one. Could you already take a closer look at the Image? What additional Preprocessing would you suggest if any?

@stweil
Copy link
Contributor

stweil commented Jun 16, 2020

@M3ssman, the link is already inactive. Please attach a sample image to the issue report here.

@zdenop
Copy link
Contributor

zdenop commented Jun 16, 2020

I just made quick test: When I removed black border and header. Then tesseract produced result (tested just with English, as I not have frk at the moment installed).
So pre-processing of such images are must or you need to implement custom layout detection.
So sake of small size this thumbnail of my testing image:
image

@M3ssman
Copy link
Contributor Author

M3ssman commented Jun 16, 2020

@zdenop Yes, many thanks, this works!
But anyway, I wonder why all the other images went fine. They are all born the same way. Never cropped, just plain TIF-files with some tuning for contrast and rotation. Tesseract shouldn't bother with borders, as it is (almost) never does.

@M3ssman
Copy link
Contributor Author

M3ssman commented Jun 16, 2020

@stweil Guess what! The original Image works, too, if I scaled it down 1:4!

Sorry, I didn't recognize the link is also off after a single download

-I'm uploading again right now.-

Sorry for the delay, but I run into trouble with uploads from home. Now, from office it is way faster:
0046.zip

@M3ssman
Copy link
Contributor Author

M3ssman commented Jun 17, 2020

@stweil Strange indeed.
I cropped only the header of the page and left left, right and bottom margins as they are and this version works fine.
0046-headless.zip

@M3ssman
Copy link
Contributor Author

M3ssman commented Jun 17, 2020

@zdenop Since the original image itself looks dark somehow, I called ImageMagick 6.9 to enhance contrast: convert 0046.tif -brightness-contrast 25x50 -compress none -colorspace Gray 0046-convert.tif
This version, without taking care for the borders, makes Tesseract producing not an empty page, but quite reasonable OCR-output.

@zdenop
Copy link
Contributor

zdenop commented Jun 17, 2020

I expect that size of 0046-convert.tif is lower. Right?

@M3ssman
Copy link
Contributor Author

M3ssman commented Jun 17, 2020

@zdenop This is what exiftool outputs:
original: Megapixels 79.1, Image size 7477x10584, 79151624 Byte filesize
convert: Megapixels 79.1, Image size 7477x10584, 79151434 Byte filesize
So pure filesizes differs slightly.

@M3ssman
Copy link
Contributor Author

M3ssman commented Jun 17, 2020

@zdenop Did you run Tesseract also with the original file (without cropping or other types of preprocessing? What was the outcome?
0046-convert.zip

@zdenop
Copy link
Contributor

zdenop commented Jun 17, 2020

Original finished quickly with "empty page" message.

@stweil stweil added the bug label Jun 17, 2020
@stweil
Copy link
Contributor

stweil commented Jun 17, 2020

The original page triggers bugs which can be shown by adding -c textord_debug_bugs=1. Tesseract creates boxes (bounding_box_) with a right margin which exceeds the image dimensions (error message Made partition with bad right coord). Those boxes are therefore disregarded. With the following hack the boxes are processed, and text is recognized:

diff --git a/src/textord/colpartition.cpp b/src/textord/colpartition.cpp
index 74f1b1d9..465a1f57 100644
--- a/src/textord/colpartition.cpp
+++ b/src/textord/colpartition.cpp
@@ -353,7 +353,7 @@ bool ColPartition::IsLegal() {
       tprintf("Margins invalid\n");
       Print();
     }
-    return false;  // Margins invalid.
+//    return false;  // Margins invalid.
   }
   if (left_key_ > BoxLeftKey() || right_key_ < BoxRightKey()) {
     if (textord_debug_bugs) {

I think that the right solution would have to find out why Tesseract creates bad bounding boxes and fix that.
Maybe it would already help to enforce boxes with valid coordinates.

@M3ssman
Copy link
Contributor Author

M3ssman commented Jun 18, 2020

@stweil Many Thanks!
By now I've detected already 200+ scans that are considered empty by Tesseract.
Therefore I'll try your suggestion in our ULB-Fork and report back hopefully next week!

@amitdo
Copy link
Collaborator

amitdo commented Jun 23, 2020

@stweil
Copy link
Contributor

stweil commented Jun 23, 2020

The image is rather large, too large to be attached. It's available here: https://ub-backup.bib.uni-mannheim.de/~stweil/tesseract/issues/3021/0046.png.

@stweil
Copy link
Contributor

stweil commented Jun 23, 2020

The bounding boxes with illegal coordinates come from rotation:

(gdb)
#1  0x00000000006311f7 in TBOX::rotate (this=0x60e0006ba830, vec=...) at ../../../src/ccstruct/rect.h:206
206	      top_right.rotate (vec);
(gdb) p vec
$22 = (const FCOORD &) @0x7fffffff8ea0: {xcoord = 0.999990165, ycoord = -0.00443068426}
(gdb) p top_right 
$23 = {xcoord = 7523, ycoord = 10551}
(gdb) p bot_left
$24 = {xcoord = 43, ycoord = 9671}

In this case vec indicates that there is nearly no rotation at all, but because of the very large value of ycoord the function ICOORD::rotate calculates a new xcoord which is clearly outside of the image. It looks like ICOORD::rotate might be wrong and need a better implementation.

The current code rotates top right and bottom left with fix point (0,0). Maybe this should be changed to fix point top left. For small coordinates that does not make a large difference, but here it is essential.

@amitdo
Copy link
Collaborator

amitdo commented Jun 23, 2020

Another command that eliminated the issue:

gm convert 3021.png -bordercolor Black -border 10x10 3021-borderb10.png

@stweil
Copy link
Contributor

stweil commented Jun 23, 2020

It's also sufficient to convert the image to JPEG. The basic issue remains of course and can also result in less obvious problems, for example missing text from smaller parts of a page only. I'd expect that typically in the lower left and right parts of large pages. -c textord_debug_bugs=1 should be the default until that problem is fixed.

@stweil
Copy link
Contributor

stweil commented Jun 23, 2020

I now tried a modified TBOX::rotate. This not only fixes the empty page problem, too, but seems to increase the amount of text which is detected at all, so it would be worth to try it also on other pages. The bad news is that the time for processing a page increases from 56 seconds to 219 seconds. Here is the code:

diff --git a/src/ccstruct/rect.h b/src/ccstruct/rect.h
index 58a867e9..e487c8c1 100644
--- a/src/ccstruct/rect.h
+++ b/src/ccstruct/rect.h
@@ -202,9 +202,13 @@ class DLLSYM TBOX  {  // bounding box
     // and top-right corners. Use rotate_large if you want to guarantee
     // that all content is contained within the rotated box.
     void rotate(const FCOORD& vec) {  // by vector
-      bot_left.rotate (vec);
+      ICOORD top_left(bot_left.x(), top_right.y());
+      bot_left -= top_left;
+      bot_left.rotate(vec);
+      bot_left += top_left;
+      top_right -= top_left;
       top_right.rotate (vec);
-      *this = TBOX (bot_left, top_right);
+      top_right += top_left;
     }
     // rotate_large constructs the containing bounding box of all 4
     // corners after rotating them. It therefore guarantees that all

@stweil stweil changed the title Tesseract 4.1.1 Empty Page Tesseract Empty Page Jul 2, 2020
@stweil
Copy link
Contributor

stweil commented Jul 2, 2020

@M3ssman, we also get "Empty page" errors in our newspaper, see example.

https://github.com/stweil/tesseract/tree/fix contains a patch which seems to fix the problem. Maybe it also gets more texts from other large images, but I am still not sure. For images with large width and height, old and new code can get different results. It would help if you (and others) could try the new code and compare the results with the unpatched Tesseract. If the new code never makes things worse, we could apply it.

@M3ssman
Copy link
Contributor Author

M3ssman commented Jul 6, 2020

@stweil Sorry for the delay!
I just took a quick shot at a single page and it did produce textlines which is per se good but forget about the quality. Tesseract is definitively not happy with this image.

I'll try to do some more testing as it affects a remarkable amount of images and report back real soon™.

The fix + textord_debug_bugs=1 produces quite a lot output. Captured into file; maybe you can get some insights.
tesseract-5.0.0-image-1681877805_J_0112_0068.log

@amitdo
Copy link
Collaborator

amitdo commented Jul 6, 2020

Another thing that will make it work is binarization.

@M3ssman
Copy link
Contributor Author

M3ssman commented Jul 7, 2020

For one of the problematic images I got:

/data/ocr-staging/ocr/1667524704_J_0190/0655.tif => 1667524704_J_0190_0655 => /data/ocr-staging/ocr/empty-pages/1667524704_J_0190_0655
Tesseract Open Source OCR Engine v5.0.0-alpha-754-g0838 with Leptonica
Page 1
Detected 7102 diacritics
index >= 0 && index < line_count:Error:Assert failed:in file src/textord/makerow.cpp, line 802
./tesseract-empty-pages.sh: Zeile 34: 29848 Abgebrochen             (Speicherabzug geschrieben) ${TESS_BIN} "$tiff_path" "${outpath}" --dpi 470 -l frk alto

I will skip this by now and move on.

With many other "Problem-Bilder" patched Tesseract yields:

Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box

@amitdo
Copy link
Collaborator

amitdo commented Jul 7, 2020

Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box

#427 #468 #1601

@amitdo
Copy link
Collaborator

amitdo commented Jul 7, 2020

These error messages are produced by Leptonica.

They are triggered by a call to pixClipBoxToForeground()

https://github.com/DanBloomberg/leptonica/blob/bbe289cf3f0fe368d5b9eac64df2ccd6e9b05c56/src/pix5.c#L1956

https://github.com/tesseract-ocr/tesseract/search?q=pixClipBoxToForeground

@M3ssman
Copy link
Contributor Author

M3ssman commented Jul 8, 2020

I've some larger tests with the patch @stweil provided, with the following results:

From 133 images

  • 6 image produce the mentioned assertion error (index >= 0 && index < line_count:Error:Assert failed:in file src/textord/makerow.cpp, line 802).
    • 4 of them are rather complex announcement pages which don't contain german fracture letters, it's all antiqua. But they were processed via "frk"-configuration.
    • 1 page has a train time table included (which takes about 2/3 of the page). This tables is rotated by 90° counter-clockwise.
    • 1 page needs dewarping, it appears to have a heavy rippled surface.
  • in addition, 13 images still yield "empty page"

I run the 6 problematic pages once more (v4.1.1-rc2-25-g9707 from alex-p with --dpi 470 -l frk alto).
This time I've got no assertion errors but 2 pages with textlines and still 4 empty pages.
After enhancing brightness (+50) contrast (between +15 and +25) also these 4 pages were processed without errors.

I'm uncertain how to deal with this.
I don't think it's a good idea to silence warnings just to have bad material passing. Second, the assertion error is only in the patched version. This error seems to be really serious, since it even halts execution of my scripts.
By now I take the blame by myself, given with advanced preprocessing tesseract produces text. Our processes watch out for those inglorious 844 byte files, but we didn't have this on our agenda before.

@stweil @amitdo @zdenop I'm fine if you close this issue, but if you'd like to, I can provide more testdata.

@stweil
Copy link
Contributor

stweil commented Jul 8, 2020

The "empty page" message means that Tesseract dropped all text boxes because the internal checks decided that they had coordinates which are out of bounds. This might only be the extreme variant of a general problem: maybe Tesseract also drops parts of other pages where it recognizes text, but not all.

That's why it would be important to run OCR on a larger test set with -c textord_debug_bugs=1 to see whether pages with OCR text also show error messages and whether these error messages correspond to missing text boxes on such pages.

@M3ssman
Copy link
Contributor Author

M3ssman commented Jul 9, 2020

@stweil
I will run the patch with the 130+ images testset and report back early next week.

@M3ssman
Copy link
Contributor Author

M3ssman commented Dec 21, 2020

@stweil Sorry for the delay!
Is the patched code in master branch already?

I'd like to put this issue to an end.
By now (IMHO) there are 2 different problems that we're facing here:

  1. Tesseract produces ALTO-files missing page content.
    This is a problem for Tesseract users / apps that utilize Tesseract's output.
  2. Tesseract makes wrong decisions about data validity. This may also effect the general detection algorithm.
    This is a problem both to OCR-Engineers as well as any succeeding users or applications.

To deal with 1), I would like appreciate Tesseract to write no output at all and/or print a warning to stdout.
If these options are not worth the additional efforts, please let me know. By now I'm checking the size of the ALTO XML - it works, but it feels like tampering with symptoms.

Number 2 seems to be a really big issue that cannot be solved in total right now.

Thanks for any investigations to @stweil, @zdenop and @amitdo ! All your inspections lead (IMHO) to the
category data error, since tuning the image-data (binarize, despickle, etc.) improves in all cases Tesseract's analyzis.
Therefore I consider this behavior not as an intrinsic problem of Tesseract, it's the data.

@amitdo
Copy link
Collaborator

amitdo commented May 14, 2021

With the code from #3418, when Sauvola binarization is used, I don't get "Empty page!!".
"

@stweil
Copy link
Contributor

stweil commented Dec 6, 2021

I just finished OCR with Tesseract 5.0.0 for a huge number of newpaper scans.

  • 6587 scans of 371629 finished with "Empty page" when Tesseract used the default binarization.
  • 105 scans of those 6587 still were "Empty page" when Tesseract was used with -c thresholding_method=2.
  • 6 scans of those 105 still were "Empty page" when Tesseract was used with -c thresholding_method=1 (example).
  • The remaining 6 scans could be processed with patched Tesseract code (https://github.com/stweil/tesseract/tree/fix).

So using a different binarization helps in most cases, but not always.

@amitdo
Copy link
Collaborator

amitdo commented Dec 6, 2021

Try to convert the jp2 to png. It does not fail for me with your example and method 2.

@stweil
Copy link
Contributor

stweil commented Dec 6, 2021

Thank you, that's interesting. I can reproduce it, and it seems to be related to the image resolution:

The original JP2 image has 300 dpi and fails:

tesseract 0312.jp2 - -l ubma/frak2021-09 -c thresholding_method=2 -c textord_debug_bugs=1
Made partition with bad left coords, 0 > -8
ColPart: (M0-B-8-B-8/-8,5017/5017)->(3599B-3599B-3607M/3599,5071/5071) w-ok=0, v-ok=0, type=0R1, fc=-1, lc=-1, boxes=1 ts=0 bs=0 ls=0 rs=0
Margins invalid
ColPart:E(M0-B-8-B-8/-8,5017/5017)->(3599B-3599B-3607M/3599,5071/5071) w-ok=0, v-ok=0, type=0T1, fc=-1, lc=-1, boxes=0 ts=0 bs=0 ls=0 rs=0
Empty page!!
Made partition with bad left coords, 0 > -8
ColPart: (M0-B-8-B-8/-8,5017/5017)->(3599B-3599B-3607M/3599,5071/5071) w-ok=0, v-ok=0, type=0R1, fc=-1, lc=-1, boxes=1 ts=0 bs=0 ls=0 rs=0
Margins invalid
ColPart:E(M0-B-8-B-8/-8,5017/5017)->(3599B-3599B-3607M/3599,5071/5071) w-ok=0, v-ok=0, type=0T1, fc=-1, lc=-1, boxes=0 ts=0 bs=0 ls=0 rs=0
Empty page!!

Converting the JP2 to PNG with convert removes the resolution information.
Tesseract therefore guesses a resolution of 367 dpi and can process the scan:

tesseract 0312.png - -l ubma/frak2021-09 -c thresholding_method=2 -c textord_debug_bugs=1
Estimating resolution as 367
Made partition with bad right coords, 556 < 577
ColPart: (M184-T204-B207/388,4146/4152)->(577B-1252T-556M/398,4194/4179) w-ok=1, v-ok=1, type=1T4, fc=-1, lc=-1, boxes=24 ts=0 bs=0 ls=0 rs=0
[...]
genommen werden. In Beantwortung verſchledener Anfragen erklärte Amts, Wirklichen Geheimen Rats Der nburg über koloniale J güch die vornebme Aufgabe bringt, ſich des Deutſchen Reiches alt
[...]

Processing the original JP2 with an explicit resolution works, too:

tesseract 0312.jp2 - -l ubma/frak2021-09 -c thresholding_method=2 -c textord_debug_bugs=1 --dpi 400
Made partition with bad right coords, 1232 < 1243
[...]
Zenommen werden. In Beantwortung verſchledener Anfragen erklärte i ĩ ü i
[...]

@wollmers
Copy link

wollmers commented Dec 6, 2021

Thank you, that's interesting. I can reproduce it, and it seems to be related to the image resolution:

The original JP2 image has 300 dpi and fails:

Is it a JPX with mask layer like this https://archive.org/details/bub_gb_qmZyOar8UHwC/page/n71/mode/2up ?

Then try the mask

bub_gb_7sFnWGI31XcC p0069-002

and negate.

CER 14.23 % is not so bad for the quality of the scan.

@stweil
Copy link
Contributor

stweil commented Dec 6, 2021

Where did you get CER 14.23 %?

@amitdo
Copy link
Collaborator

amitdo commented Dec 6, 2021

@stweil, GIMP reports '72 ppi' for your jp2, but as you said Tesseract see it as 300 ppi. IIRC, when GIMP does not find the ppi in the image metadata, it is reported as 72 ppi.

@wollmers
Copy link

wollmers commented Dec 7, 2021

Where did you get CER 14.23 %?

Good question;-) On logical page 47 of Galileos book.

My comment was meant as: If your jp2 has a mask layer, as jp2 allows many kinds of compressions, then try the mask layer.

The book exists on archive.org in two versions, scanned from two different specimens in different bad conditions:

  • bub_gb_7sFnWGI31XcC
  • bub_gb_qmZyOar8UHwC
$ pdfimages -f 69 -l 69 -list bub_gb_7sFnWGI31XcC.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
  69     0 image     600   834  rgb     3   8  jpx    no       338  0   200   201 4818B 0.3%
  69     1 image    1800  2500  rgb     3   8  jpx    no       339  0   600   600 13.7K 0.1%
  69     2 mask     1800  2500  -       1   1  jpx    no       339  0   600   600 13.7K 2.5%    <-- the mask

$ pdfimages -f 69 -l 69 bub_gb_7sFnWGI31XcC.pdf bub_gb_7sFnWGI31XcC.p0069

$ ls -la bub_gb_7sFnWGI31XcC.p0069*
bub_gb_7sFnWGI31XcC.p0069-000.ppm
bub_gb_7sFnWGI31XcC.p0069-001.ppm
bub_gb_7sFnWGI31XcC.p0069-002.pbm   <-- the mask

$ convert bub_gb_7sFnWGI31XcC.p0069-002.pbm -negate -density 600x600 -units PixelsPerInch bub_gb_7sFnWGI31XcC.p0069-002.pos.tiff

$ tesseract bub_gb_7sFnWGI31XcC.p0069-002.pos.tiff ...

If I recorded correctly (should write a script for permutations and recording them):

Latin = -l lat
GT4 = -l GT4Hist
ubma = -l ubma/frak2021_0.905_1587027_9141630

CER     variant
0.0567 bub_gb_7sFnWGI31XcC.p0069-002.pos.nopsm.ubma.txt
0.0841 bub_gb_7sFnWGI31XcC.p0069-002.pos.nopsm.GT4.txt
0.1227 bub_gb_7sFnWGI31XcC.p0069-002.pos.nopsm.Latin.txt

0.1618 bub_gb_qmZyOar8UHwC.p0058-002.pos.nopsm.ubma.txt

@stweil
Copy link
Contributor

stweil commented Dec 7, 2021

GIMP reports '72 ppi'

Obviously GIMP ignores the EXIF metadata. GIMP has a menu entry which shows the metadata and also the EXIF part with x/y resolutions of 300 and the resolution unit "inch". exiftool shows resolutions of 118.1102 and the resolution unit "cm" which gives the same DPI value of 300 (118.1102 * 2.54).

@wollmers
Copy link

wollmers commented Dec 8, 2021

AFAIK 72 ppi is the default in some image programs. In GIMP it's AFAIR default only in the GUI Image -> change resolution.

EXIF is the wrong place to specify ppi. convert ... -density 600x600 -units PixelsPerInch ... is reliable, but not all image formats can store it.

@aved12
Copy link

aved12 commented Jan 5, 2022

try this code @M3ssman
"""from PIL import Image ,ImageEnhance

im = Image.open(r""+"C:\Users\user\Documents\Lightshot\stry5.png")
cness = ImageEnhance.Sharpness(im)
cFactor = 2
im = cness.enhance(cFactor)
cness = ImageEnhance.Brightness(im)
cFactor = 3
im = cness.enhance(cFactor)
im.show()
im.save(r""+"C:\Users\user\Documents\Lightshot\stry7.png",quality=95)"""

it finds blobs for all characters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants