Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in extracting text from Excel TIF image #4014

Closed
jwmepiq opened this issue Feb 6, 2023 · 8 comments
Closed

Regression in extracting text from Excel TIF image #4014

jwmepiq opened this issue Feb 6, 2023 · 8 comments

Comments

@jwmepiq
Copy link

jwmepiq commented Feb 6, 2023

Environment

ExcelTest_Bug.zip
ExcelTest_text_TesseractV4.txt

Tesseract Version: 5.2.0 vs. 4.1.1.-rc2-37-gcla5
Ubuntu 20.04.3 LTS

Current Behavior:

With the attached TIF image of an Excel file (in the zip), Tesseract version 5.2.0 extracts a minimal amount of text (only a single line "hiding rows 15 through 20"). However, in prior versions of Tesseract, namely the version 4.1.1 version noted above, but likely other versions as well, the amount of text extracted from the same TIF image is significantly larger (multiple lines of text, approximately 1K of text over multiple pages). Attached a separate text file with the output of the V4.x text output.

Expected Behavior:

Expecting version 5.2+ of Tesseract to at least replicate the behavior of prior versions in extracting text from this sample TIF.

Suggested Fix:

Correct the text extraction to match the output from previous Tesseract versions. Concerned with Tesseract's regression in ability to extract text from Excel files.

@vamsiyadavmolli
Copy link

This is an issue in 5.3 version as well. Tried using both best/fast trained data still we see this issue.

@dhairyagupta2603
Copy link

Hey, Just wanted to know the status of this issue. Can I take this up?

@stweil
Copy link
Contributor

stweil commented Oct 4, 2023

I can confirm the issue with the latest code. What about releases between 4.1.1 (working) and 5.2.0 (not working)? Can we narrow down which release introduced the regression?

@stweil
Copy link
Contributor

stweil commented Oct 4, 2023

According to git bisect the regression was introduced by commit 842cca1. So release5.0.0-beta-20210916 was the last one without this issue.

stweil added a commit to stweil/tesseract that referenced this issue Oct 4, 2023
…-ocr#4014)

"auto" resulted in unsigned numbers, but htext_score and vtest_score
can be negative.

Fixes: 842cca1 ("Fix more signed/unsigned compiler warnings")
Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Contributor

stweil commented Oct 4, 2023

Pull request #4136 fixes the regression.

@stweil
Copy link
Contributor

stweil commented Oct 4, 2023

@jwmepiq, thank you for reporting this nasty regression.

stweil added a commit that referenced this issue Oct 5, 2023
"auto" resulted in unsigned numbers, but htext_score and vtest_score
can be negative.

Fixes: 842cca1 ("Fix more signed/unsigned compiler warnings")
Signed-off-by: Stefan Weil <sw@weilnetz.de>
@tfmorris
Copy link
Contributor

Now that #4136 has been merged and 5.3.3 released, I assume this can be closed

@stweil stweil closed this as completed Oct 23, 2023
@stweil
Copy link
Contributor

stweil commented Oct 23, 2023

Yes, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants