Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify OCR for inverted text #3141

Merged
merged 2 commits into from
Oct 27, 2020
Merged

Modify OCR for inverted text #3141

merged 2 commits into from
Oct 27, 2020

Conversation

stweil
Copy link
Contributor

@stweil stweil commented Oct 26, 2020

The old code tries a 2nd OCR on lines when a word confidence is below 50 %, so a single word with low confidence triggers it. This happens rather often, mostly for lines without inverted text, and costs performance.

The new code checks the mean confidence instead of the minimum. This typically improves the performance significantly and still works for lines with much inverted text.

In addition, the OCR result of the inverted image is now accepted if its mean confidence is better than the original one. In my test this improves the OCR result, especially for lines which only have some words with inverted text.

The algorithm still does not handle lines with both normal and inverted text optimally. OCR for such lines could be improved by inverting word wise or maybe by using Leptonica function pixAutoPhotoinvert.

The old code looked for the minimum confidence which triggered
very often a 2nd OCR without improving the result.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Contributor Author

stweil commented Oct 27, 2020

The time for OCR of a double page from a historical newspaper was reduced from 185 s to 136 s by these modifications.

@zdenop zdenop merged commit 5761880 into tesseract-ocr:master Oct 27, 2020
@zdenop
Copy link
Contributor

zdenop commented Oct 27, 2020

thanks.

@Shreeshrii
Copy link
Collaborator

@zdenop Are you also applying these improvements to the 4.1 branch?

@stweil
Copy link
Contributor Author

stweil commented Oct 27, 2020

Are you also applying these improvements to the 4.1 branch?

I am not sure whether that would be correct for 4.1, because it is not a bug fix, but a modification which changes OCR results. So it would require at least a new 4.2.

@Shreeshrii
Copy link
Collaborator

Thanks, @stweil. Yes, 4.2 will be appropriate.

@zdenop You could include other recent improvements also which are suitable for 4.x branches.

@stweil
Copy link
Contributor Author

stweil commented Oct 27, 2020

I got some test images from the Internet Archive. The new code gives better results for many lines in those images, but some lines which were inverted with the old code are now no longer inverted and are not recognized.

Changing the threshold from 0.5 to 0.7 still got all improvements, but fixed the lines with regressions. If that is confirmed in more tests, it should be changed in the code. Maybe the threshold could be a parameter.

Old and new results differ here:

diff --rec -u old/sim_japanese-economy_fall-1989_18_1_0099.txt new/sim_japanese-economy_fall-1989_18_1_0099.txt
--- old/sim_japanese-economy_fall-1989_18_1_0099.txt	2020-10-27 09:49:56.105843034 +0100
+++ new/sim_japanese-economy_fall-1989_18_1_0099.txt	2020-10-27 14:35:58.858059925 +0100
@@ -2,7 +2,7 @@
 agement material from Japanese sources, primarily scholarly journals and
 books. The selections are intended to reflect developments in the Japanese
 economy and to be of interest to those professionally concerned with this
-i H
+field.
 
 Editor: Kazuo Sato, Rutgers University
 Assistant Editor: Anita M. O’Brien, M. E. Sharpe, Inc.
diff --rec -u old/sim_journal-of-burn-care-research_1983-02_4_1_0000.txt new/sim_journal-of-burn-care-research_1983-02_4_1_0000.txt
--- old/sim_journal-of-burn-care-research_1983-02_4_1_0000.txt	2020-10-27 09:50:07.041917640 +0100
+++ new/sim_journal-of-burn-care-research_1983-02_4_1_0000.txt	2020-10-27 14:36:01.022075792 +0100
@@ -10,10 +10,10 @@
 \dult Learning Principles—Basis for Educaling the Burn Nurse
 Opportunities and Expanded Educational Roles in Burn Nursing
 ducational Approaches to Burn Nuising Orientation
-Development of Competencv-Based STt R IR Tt \m"sin:
+Development of Competencv-Based Orientation for Burn \m"sin:
 Developing Outreach Programs in Community Hospitals
 
-The |’|'i|n;u'.\ A TR Svstemn
+The |’|'i|n;u'.\ Nursing Svstemn
 
 Characteristics of Burn Orientation Programs
 
diff --rec -u old/sim_journal-of-burn-care-research_1983-02_4_1_0006.txt new/sim_journal-of-burn-care-research_1983-02_4_1_0006.txt
--- old/sim_journal-of-burn-care-research_1983-02_4_1_0006.txt	2020-10-27 09:50:09.589935013 +0100
+++ new/sim_journal-of-burn-care-research_1983-02_4_1_0006.txt	2020-10-27 14:36:02.006083007 +0100
@@ -5,11 +5,11 @@
 
 LR
 
-bk
+112
 
 =y
 
  
 
-£
+I
diff --rec -u old/sim_journal-of-burn-care-research_2007-04_28_2_0000.txt new/sim_journal-of-burn-care-research_2007-04_28_2_0000.txt
--- old/sim_journal-of-burn-care-research_2007-04_28_2_0000.txt	2020-10-27 09:50:39.058135718 +0100
+++ new/sim_journal-of-burn-care-research_2007-04_28_2_0000.txt	2020-10-27 14:36:06.314114591 +0100
@@ -10,9 +10,9 @@
 
 2
 e
-5
-2
-O
+)
+b
+(4]
 =z
 
 jiation
diff --rec -u old/sim_television-broadcast-tvb_2003-04_26_4_0000.txt new/sim_television-broadcast-tvb_2003-04_26_4_0000.txt
--- old/sim_television-broadcast-tvb_2003-04_26_4_0000.txt	2020-10-27 09:50:44.414172153 +0100
+++ new/sim_television-broadcast-tvb_2003-04_26_4_0000.txt	2020-10-27 14:36:07.990126878 +0100
@@ -1,4 +1,4 @@
-A T D R T OF TELEVlSlON
+& THE SCIENCE, ART & BUSINESS OF TELEVlSlON
 
 NABZUUS
 
@@ -11,18 +11,18 @@
 1553
 4 el sl L, 1. n A
 
-Il Congess ik
+Will Congess Put The
 
 2
-- £ 4
+h £ 4
 L] . i »
-- { bl
+. { bl
 ¢ § ias P
 ' i
 A L I | b+ Pe. .:v
 5  je
 v. = !
-- b 5 52
+. b : 52
 R 4
 ~ R
 
diff --rec -u old/sim_television-broadcast-tvb_2003-04_26_4_0003.txt new/sim_television-broadcast-tvb_2003-04_26_4_0003.txt
--- old/sim_television-broadcast-tvb_2003-04_26_4_0003.txt	2020-10-27 09:50:58.062264937 +0100
+++ new/sim_television-broadcast-tvb_2003-04_26_4_0003.txt	2020-10-27 14:36:11.038149223 +0100
@@ -7,7 +7,7 @@
  
 
 | ‘ DM-3000
-Single or Multi-Channel Composite; Component, TRy a0
+Single or Multi-Channel Composite; Component, Svnchronous AES/EBU
 SDI Master Control Switcher SDI, ASI, HDTV Routers ar\:gcAr:glnooguZudio/Routers
 
 System Configuration, Operation,
diff --rec -u old/sim_television-broadcast-tvb_2003-04_26_4_0006.txt new/sim_television-broadcast-tvb_2003-04_26_4_0006.txt
--- old/sim_television-broadcast-tvb_2003-04_26_4_0006.txt	2020-10-27 09:51:04.978311923 +0100
+++ new/sim_television-broadcast-tvb_2003-04_26_4_0006.txt	2020-10-27 14:36:12.558160366 +0100
@@ -5,13 +5,13 @@
       
   
 
-L/
+/,
 
-See Us AT NAB-Booth #C-2660 TRE ol a1 = Ao aa!
+See Us AT NAB-Booth #C-2660 www.sachtler.com
 
-YT ]
+sachtler,
 corporation of america
-F 55. North Main Stréet, Freeport, N.Y. 11520
+& 55. North Main Stréet, Freeport, N.Y. 11520
 f"‘&w Phone: 516 867 4900  Fax: 516 623 6844
 Saalag W email: sachtlerUS@aol.com
 S— sachtler
diff --rec -u old/sim_television-broadcast-tvb_2003-04_26_4_0048.txt new/sim_television-broadcast-tvb_2003-04_26_4_0048.txt
--- old/sim_television-broadcast-tvb_2003-04_26_4_0048.txt	2020-10-27 09:51:47.066597411 +0100
+++ new/sim_television-broadcast-tvb_2003-04_26_4_0048.txt	2020-10-27 14:36:16.834191712 +0100
@@ -1,8 +1,8 @@
-e
+"
 
  
 
-|
+3
 "
 
  
@@ -63,15 +63,15 @@
 
  
 
-e g
+o
 
-A\ ol B AY
+"I\ ™ o AY
 
         
 
 THE PROFESSIONAL OPTICAL
 DISC SYSTEM WILL ENHANCE
-QY= 115 HO 27520
+EVERY STEP OF VIDEO
 WORKFLOW, FROM ACQUISITION
 TO POSFPRODUCTION
 
@@ -84,14 +84,14 @@
 duction, will be greatly improved, says
 Theresa Alesso, director of marketing
 for the optical and network products
-1118
+group.
 
 “Our professional optical disc sys-
 tem will offer dramatically faster edit-
 ing, faster transfer from field to stu-
 dio, far easier identification of record-
 ed assets, and lower operating costs,”
-LE R E TN
+says Alesso.
 
 Sony'’s two professional optical disc
 camcorders will enable users to mark
diff --rec -u old/sim_television-broadcast-tvb_2004-05_27_5_0014.txt new/sim_television-broadcast-tvb_2004-05_27_5_0014.txt
--- old/sim_television-broadcast-tvb_2004-05_27_5_0014.txt	2020-10-27 14:30:45.231753729 +0100
+++ new/sim_television-broadcast-tvb_2004-05_27_5_0014.txt	2020-10-27 14:36:19.318209920 +0100
@@ -3,7 +3,7 @@
  Register Today
 
 | FOR THE EAST COAST'S
-Y\ [ 33 AT DI W YD
+' LARGEST VIDEO AND AUDIO
 | CONFERENCE AND EXPOSITION!
 
  
@@ -41,7 +41,7 @@
 
 GovernmentVideo  videography
 
-FIALH
+FEATURING

@MerlijnWajer
Copy link
Contributor

Thanks for this!

I'm happy to run a set of tests for some Internet Archive items. It would be easier for me if the threshold is a parameter, but can also build Tesseract locally. Did the higher threshold have a significant impact on the processing speed? I assume it probably doesn't have a significant impact, since only a small portion will be inverted & processed again?

@zdenop
Copy link
Contributor

zdenop commented Oct 27, 2020

@Shreeshrii : I do not plan 4.x release at the moment:

  • it is quite difficult to make release with breaking API/ABI... ;-)
  • changes in master are quite significant (code modernisation, source reorganisation), so backporting interesting commits start to be time expensive
  • personally I do not have time a lot of free time

@stweil
Copy link
Contributor Author

stweil commented Oct 27, 2020

I also see no urgent need for a new Tesseract 4 release.

We might consider tagging new releases from Git master as the last one was 5.0.0-alpha. I suggest to name them 5.0.0-alpha-YYYYMMDD, for example 5.0.0-alpha-20201027, or a little bit longer ``5.0.0-alpha-2020-10-27`.

@stweil
Copy link
Contributor Author

stweil commented Oct 27, 2020

Did the higher threshold have a significant impact on the processing speed? I assume it probably doesn't have a significant impact, since only a small portion will be inverted & processed again?

That's right. The most significant impact on the performance comes from testing the mean confidence value instead of the minimum confidence value. A lot of lines contain some part with low word confidence, and all those lines were processed twice with the old code.

@stweil
Copy link
Contributor Author

stweil commented Dec 16, 2020

@zdenop, @egorpugin, @Shreeshrii, @amitdo: Would it be okay if I tag a new release like I described it above? Or are there other suggestions?

@egorpugin
Copy link
Contributor

egorpugin commented Dec 16, 2020

If you set pre-release tag, I think it's ok.

Upd.:
Do we have a roadmap for tess-5.0.0?
I think it would be cool if we switch to std::{string,vector etc.} types before it.

@stweil
Copy link
Contributor Author

stweil commented Dec 16, 2020

I think it would be cool if we switch to std::{string,vector etc.} types before it.

Yes, that's why I think it is worth waiting with 5.0.0.

@amitdo
Copy link
Collaborator

amitdo commented Dec 16, 2020

5.0.0-alpha-YYYYMMDD

+1

@egorpugin
Copy link
Contributor

According to https://semver.org/ we can do 5.0.0-alpha.YYYYMMDD (dot instead of second dash).
But I'm ok with both.

@Shreeshrii
Copy link
Collaborator

Thanks @stweil. I like the idea of having the date as part of the tag.

What will the version strings for later commits look like?

@M3ssman
Copy link
Contributor

M3ssman commented Dec 17, 2020

What about switching from semantic versioning to calendar versioning?

The next official version be 2021.03 (example).

This scheme (used by Ubuntu, Eclipse, IntelliJ, Windows, Docker ... and OCR-D/all) signals an ongoing development and advances regular releases. Usually, it expresses versions on monthly or even daily schedules. And most important: it decouples release management from semantically sophistries, to decide what sort of changes are included in the next version and how they might affect application users. Because, these decisions are quite often not so straight forward:

What are the reasons for tagging a new major version?

Semantically, this is widely used to communicate changes to the external API that client applications might break (https://en.wikipedia.org/wiki/Software_versioning#Sequence-based_identifiers). If there are only changes in behavior, even clear improvements (like with this PR), this is something desired, but this will not affect client applications (like https://github.com/sirfz/tesserocr).
Changes of the underlying code base should be noted as refactoring, which means, that they don't even change existing behavior, because the new implementation should produce the same results as the old one did.
From a developers point of view I can comprehend the psychological demand for a new major version, just to underline that internally so many things were changed and so much efforts have had been taken to improve or refactor, but from an interface perspective I cannot obey.

Mixing versioning schemes is not considered a good practise (https://mitchdenny.com/dates-in-version-numbers/)

@Shreeshrii
Copy link
Collaborator

@M3ssman
Copy link
Contributor

M3ssman commented Dec 17, 2020

@Shreeshrii Thanks, this is indeed very informative. I didn't knew this service before. Even extracts a change log!

@stweil
Copy link
Contributor Author

stweil commented Dec 17, 2020

According to https://semver.org/ we can do 5.0.0-alpha.YYYYMMDD (dot instead of second dash).

We switched to semver versioning some time ago, so of course the new release tag must match the semver rules.

5.0.0-alpha-YYYYMMDD is compatible with semver: the patch version is 5.0.0, and the pre-release identifier is alpha-YYYYMMDD. 5.0.0-alpha.YYYYMMDDwould be possible, too, and has two pre-release identifiers alpha and YYYYMMDD.

@egorpugin
Copy link
Contributor

egorpugin commented Dec 17, 2020

Check grammar https://semver.org/

<version core> "-" <pre-release> "+" <build>

As I understand there are no any other "-" dash symbols except for splitting core and pre-release.

@stweil
Copy link
Contributor Author

stweil commented Dec 17, 2020

What will the version strings for later commits look like?

We would use the pattern 5.0.0-alpha-YYYYMMDD until we have release candidates like 5.0.0-rc1 or the final 5.0.0. A Tesseract version string would look like 5.0.0-alpha-20201217-2-g58aa for example.

@Shreeshrii
Copy link
Collaborator

@stweil
Copy link
Contributor Author

stweil commented Dec 17, 2020

As I understand there are no any other "-" dash symbols except for splitting core and pre-release.

Dash can be part of the pre-release identifier:

<non-digit> ::= <letter>
              | "-"

@egorpugin
Copy link
Contributor

Ah, I see, yes.

@Shreeshrii
Copy link
Collaborator

Heads up for tag for proposed new 5.0.0-alpha-YYYYMMDD release
@lvc for updating abi-tracker
@AlexanderP for PPA etc

@linuxhw
Copy link

linuxhw commented Dec 21, 2020

@stweil
Copy link
Contributor Author

stweil commented Dec 22, 2020

Thanks for the reminder. As 4.1.1 already exists and there is no urgent need to fix those issues in 4.1.x, I moved the milestones to 5.0.0 now. One of the issues could also be closed.

As long as there are only few people working on Tesseract, I see no chance for full support of more than a single major release. That means only critical bugs will be fixed in 4 by backporting of changes from 5. But this is open source development, so anybody can send pull requests to backport compatible changes to 4.

Regarding the release tag, we don't mix. We use semantic versioning which allows pre-release identifiers with a date part. We also don't add a commit hash to the tag. Such commit hashs are added automatically to the Tesseract version string for builds which are not based on a tagged release.

@amitdo
Copy link
Collaborator

amitdo commented Dec 22, 2020

It's a bit complicated issue.

We have two kinds of users:

  • tesseract command line users (regular users).
  • libtesseract users (developers).

The reason we did not release 5.0.0 yet is that we still plan to break libtesseract ABI, and we don't want to release 5.0.0 and then to release 6.0.0 with a new ABI 6/12 months later.

But that leaves the regular users with the one year old 4.1.1, and we don't know when we will release 5.0.0.

When people report about an issue they have with 4.1.1. many times we tell them: "Try 5.0.0...".

There is also the LTS users with their old/very old versions of tesseract.

Many users won't use an alpha/beta version on production, so they want a new stable release.

I don't know what can we do about this issue in a way that will please most users. You should remember that we have a limited resources. Ignore the 'by Google" we still advertise. They don't participate in this project anymore.

@stweil
Copy link
Contributor Author

stweil commented Dec 24, 2020

The new pre-release 5.0.0-alpha-20201224 is available now. Merry Christmas and thank you to everybody.

@Shreeshrii
Copy link
Collaborator

Merry Christmas @stweil. Thank you and all the contributors.

@MerlijnWajer
Copy link
Contributor

JFYI as a follow up, this is what happened to archive.org's OCR speed once we switched from 4.1.1 to 5.0.0-alpha-202012-31.

ocr-perf

Seems like the 20-30% estimated speedup holds even at 3 million pages per day. Awesome. :-)

@amitdo
Copy link
Collaborator

amitdo commented Dec 27, 2021

I got some test images from the Internet Archive. The new code gives better results for many lines in those images, but some lines which were inverted with the old code are now no longer inverted and are not recognized.

Changing the threshold from 0.5 to 0.7 still got all improvements, but fixed the lines with regressions. If that is confirmed in more tests, it should be changed in the code. Maybe the threshold could be a parameter.

@stweil,

Maybe we should change the threshold to 0.7 or add a invert_threshold parameter for 5.0.1?

@stweil
Copy link
Contributor Author

stweil commented Dec 29, 2021

This should have be done for the 5.0.0 release, but I missed it because the issue was neither assigned to a project nor to a milestone.

@amitdo
Copy link
Collaborator

amitdo commented Jun 23, 2022

Can this change be done in 5.2.0, or do you prefer to change it in 6.0 ?

@stweil
Copy link
Contributor Author

stweil commented Jun 24, 2022

Thanks for the reminder. I think it can be done in 5.2.0.

@stweil
Copy link
Contributor Author

stweil commented Jun 24, 2022

Maybe we should change the threshold to 0.7 or add a invert_threshold parameter for 5.0.1?

I just try to implement such a parameter. It could make the existing parameter tessedit_do_invert redundant, because a threshold greater than 0.0 could be equivalent to tessedit_do_invert == true.

Should the new parameter be called invert_threshold as suggested or is there a better name (tessedit_invert_threshold)?

@amitdo
Copy link
Collaborator

amitdo commented Jun 25, 2022

Choose your preferred option.

@rmast
Copy link

rmast commented Jun 25, 2022

I bumped into this discussion. I have a image with both normal and inverted text that doesn't quite recognize well on the left bottom with language nld. However, when I invert it, all text seems to be recognized right, inverted or not...

internetarchive/archive-pdf-tools#55

tesseract --version
tesseract 5.1.0
leptonica-1.79.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511

@stweil
Copy link
Contributor Author

stweil commented Jun 26, 2022

That image is recognized much better with model script/Latin instead of nld. But the results for nld also improve with parameter -c invert_threshold=0.9 in my new code (see pull request #3852).

@rmast
Copy link

rmast commented Jun 26, 2022 via email

@stweil
Copy link
Contributor Author

stweil commented Jun 26, 2022

Is there double content? Then that's wrong.

@MerlijnWajer
Copy link
Contributor

@rmast - I would test with just Tesseract, not other tools like OCRMyPDF that use Tesseract, and then see if you still get double content. I have never seen this - the only issue I've seen is diplopia (for which is there is an open MR), but that is only per character.

@rmast
Copy link

rmast commented Jun 26, 2022

tesseract 175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg out -l nld -c invert_threshold=0.9 hocr

For example "wis - clear" twice at slightly the same spot, by the way both with sufficient but slightly different confidence.

   <div class='ocr_carea' id='block_1_27' title="bbox 2164 361 2399 410">
    <p class='ocr_par' id='par_1_37' lang='nld' title="bbox 2149 361 2399 410">
     <span class='ocr_line' id='line_1_76' title="bbox 2164 361 2399 410; baseline -0.008 -14.121; x_size 28.333334; x_descenders 5.3333335; x_ascenders 7">
      <span class='ocrx_word' id='word_1_308' title='bbox 2164 375 2175 396; x_wconf 49'>&gt;</span>
      <span class='ocrx_word' id='word_1_309' title='bbox 2194 361 2341 410; x_wconf 92'>wis-clear</span>
      <span class='ocrx_word' id='word_1_310' title='bbox 2373 361 2399 410; x_wconf 46'>|</span>
      <span class='ocrx_word' id='word_1_311' title='bbox 2194 372 2237 395; x_wconf 90'>wis</span>
      <span class='ocrx_word' id='word_1_312' title='bbox 2249 384 2257 388; x_wconf 85'>-</span>
      <span class='ocrx_word' id='word_1_313' title='bbox 2269 372 2335 395; x_wconf 96'>clear</span>
     </span>
    </p>
   </div>
tesseract 5.1.0-71-g7c0c
 leptonica-1.79.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511

@amitdo
Copy link
Collaborator

amitdo commented Jun 26, 2022

@rmast, mabye #3489 is related to your issue.

@rmast
Copy link

rmast commented Jun 26, 2022 via email

@stweil
Copy link
Contributor Author

stweil commented Jun 26, 2022

git bisect shows that before commit 2252936 that example of double content just returned "EEEN wis - clear". Reverting that commit in the latest code indeed removes the double content with the nld model, but with script/Latin the double content is still there. So the bisect result only shows that the issue is sensitive to small changes.

Running git bisect with script/Latin gives a different and more plausible result. It finds that commit eaf72ac is causing the double content. I'll have a look whether this can be fixed.

@amitdo
Copy link
Collaborator

amitdo commented Jun 28, 2022

@stweil, do you consider this issue a blocker for releasing 5.2.0?

@rmast
Copy link

rmast commented Aug 19, 2022

@rmast, mabye #3489 is related to your issue.

I tried to understand #3489 to look whether my fix would apply, but I guess #3489 has to do with the difference between automatic segmentation and 'straight' -psm 4 processing of a line that really isn't that straight. My issue was about inversion which has implementations in Tesseract on different levels.

@rmast
Copy link

rmast commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants