Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

layout: Empty Page output for default psm #3670

Open
Shreeshrii opened this issue Dec 2, 2021 · 14 comments
Open

layout: Empty Page output for default psm #3670

Shreeshrii opened this issue Dec 2, 2021 · 14 comments

Comments

@Shreeshrii
Copy link
Collaborator

For certain images the default psm gives Empty Page as output while --psm 6 and others give the correct result.

Suggest that in cases where default psm results in Empty Page, try recognizing image with --psm 6 automatically along with a DEBUG message.

$ tesseract -v
tesseract 5.0.0-1-g4abb
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found NEON
 Found OpenMP 201511
 Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
 Found libcurl/7.58.0 NSS/3.35 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3

Example image:
eng Charis_SIL_Italic exp0_27

$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --tessdata-dir ~/tessdata
Empty page!!
Empty page!!
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --tessdata-dir ~/tessdata_best
Empty page!!
Empty page!!
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --tessdata-dir ~/tessdata_fast
Empty page!!
Empty page!!
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 0
Too few characters. Skipping this page
Too few characters. Skipping this page
Error during processing.
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 1
Too few characters. Skipping this page
OSD: Weak margin (0.00) for 4 blob text block, but using orientation anyway: 0
Empty page!!
Too few characters. Skipping this page
OSD: Weak margin (0.00) for 4 blob text block, but using orientation anyway: 0
Empty page!!
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 2
Empty page!!
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 3
Empty page!!
Empty page!!
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 4
Empty page!!
Empty page!!
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 5
oy
0
0
O
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 6
6881
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 7
6881
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 8
6881
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 9
6881
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 10
6881
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 11
6881
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 12
Too few characters. Skipping this page
OSD: Weak margin (0.00) for 4 blob text block, but using orientation anyway: 0
6881
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 13
6881
$
@Shreeshrii Shreeshrii changed the title layout: Empty Page output for default psm layout: Empty Page output for default psm Dec 2, 2021
@stweil
Copy link
Contributor

stweil commented Dec 2, 2021

Using --psm 6 is at least for newspapers (where we also see "empty" pages) not the correct solution. In those cases using a different binarization usually helps.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Dec 2, 2021

You are right. --psm 6 will work only if the input is a single line image.

I am finding the issue in about 1% of images generated by tesseract unpack from lstmf files which were generated by text2image. Shouldn't all these files have same binarization?

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Dec 2, 2021

Here is a zip file with some images which have this problem. A few ok images are also included.

EmptyPage.zip.zip
In most cases it is images with a single word/number in it in a large font size. Hope this helps in isolating the cause.

@amitdo
Copy link
Collaborator

amitdo commented Dec 2, 2021

Empty page!!
Empty page!!

Why is this message printed twice?

@amitdo
Copy link
Collaborator

amitdo commented Dec 2, 2021

Does this also happen with --oem 0?

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Dec 3, 2021

Yes, it is also happening with --oem 0.

(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --oem 0 --tessdata-dir ../tessdata
Empty page!!
Empty page!!

The problem seems to be related to dpi.

(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --dpi 600
Empty page!!
Empty page!!
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --dpi 300
Empty page!!
Empty page!!
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --dpi 200
6881
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --dpi 250
Empty page!!
Empty page!!
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --dpi 150
6881

Image is being recognized if I assign dpi 200 and 150.

I tried to display the earlier messages regarding the dpi being used, but they seem to have been suppressed now , even with --loglevel ALL.

(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract eng.Charis_SIL_Italic.exp0_27.png -  --loglevel ALL
Empty page!!
Empty page!!

@amitdo
Copy link
Collaborator

amitdo commented Dec 3, 2021

In general, if you know in advance that the input is one line, then you should use --psm 7.

@amitdo
Copy link
Collaborator

amitdo commented Dec 3, 2021

The dpi in this image is in valid range (301) so tesseract will respect it and will not try to estimate it. That's why there is no warning.

@amitdo
Copy link
Collaborator

amitdo commented Dec 3, 2021

Your suggestion to make Tesseract do a second try can be improved by taking into account image height and number of blobs. For example, If the image height is below 60 pixels and has less than 100 blobs, Tesseract can try psm 6 and if it also fails it can then try psm 7.

@amitdo
Copy link
Collaborator

amitdo commented Dec 3, 2021

Using the API, you can give Tesseract an alternative config file and if recognition fails, Tesseract will do a second try using this config file.

@amitdo
Copy link
Collaborator

amitdo commented Dec 3, 2021

tesseract/src/api/baseapi.cpp

Lines 1268 to 1283 in b649222

if (failed && retry_config != nullptr && retry_config[0] != '\0') {
// Save current config variables before switching modes.
FILE *fp = fopen(kOldVarsFile, "wb");
if (fp == nullptr) {
tprintf("Error, failed to open file \"%s\"\n", kOldVarsFile);
} else {
PrintVariables(fp);
fclose(fp);
}
// Switch to alternate mode for retry.
ReadConfigFile(retry_config);
SetImage(pix);
Recognize(nullptr);
// Restore saved config variables.
ReadConfigFile(kOldVarsFile);
}

@Shreeshrii
Copy link
Collaborator Author

In general, if you know in advance that the input is one line, then you should use --psm 7.

I am trying to look for alternative ways to evaluate the recognition by different models since lstmeval does not give accurate results. So, I am using the single line images used for training and eval by tesstrain makefile and then using OCR results using ocrevalUAtion and ISRI tools. I could use --psm 7 for it. Would that be considered ok as a basis for evaluation?

@amitdo
Copy link
Collaborator

amitdo commented Dec 3, 2021

I don't know, you can try and see...

@stweil
Copy link
Contributor

stweil commented Dec 6, 2021

Empty page output for complex newspaper pages is handled in issue #3021.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants