-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify OCR for inverted text #3141
Conversation
The old code looked for the minimum confidence which triggered very often a 2nd OCR without improving the result. Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The time for OCR of a double page from a historical newspaper was reduced from 185 s to 136 s by these modifications. |
thanks. |
@zdenop Are you also applying these improvements to the 4.1 branch? |
I am not sure whether that would be correct for 4.1, because it is not a bug fix, but a modification which changes OCR results. So it would require at least a new 4.2. |
I got some test images from the Internet Archive. The new code gives better results for many lines in those images, but some lines which were inverted with the old code are now no longer inverted and are not recognized. Changing the threshold from 0.5 to 0.7 still got all improvements, but fixed the lines with regressions. If that is confirmed in more tests, it should be changed in the code. Maybe the threshold could be a parameter. Old and new results differ here:
|
Thanks for this! I'm happy to run a set of tests for some Internet Archive items. It would be easier for me if the threshold is a parameter, but can also build Tesseract locally. Did the higher threshold have a significant impact on the processing speed? I assume it probably doesn't have a significant impact, since only a small portion will be inverted & processed again? |
@Shreeshrii : I do not plan 4.x release at the moment:
|
I also see no urgent need for a new Tesseract 4 release. We might consider tagging new releases from Git master as the last one was |
That's right. The most significant impact on the performance comes from testing the mean confidence value instead of the minimum confidence value. A lot of lines contain some part with low word confidence, and all those lines were processed twice with the old code. |
@zdenop, @egorpugin, @Shreeshrii, @amitdo: Would it be okay if I tag a new release like I described it above? Or are there other suggestions? |
If you set pre-release tag, I think it's ok. Upd.: |
Yes, that's why I think it is worth waiting with 5.0.0. |
+1 |
According to https://semver.org/ we can do |
Thanks @stweil. I like the idea of having the date as part of the tag. What will the version strings for later commits look like? |
What about switching from semantic versioning to calendar versioning? The next official version be This scheme (used by Ubuntu, Eclipse, IntelliJ, Windows, Docker ... and OCR-D/all) signals an ongoing development and advances regular releases. Usually, it expresses versions on monthly or even daily schedules. And most important: it decouples release management from semantically sophistries, to decide what sort of changes are included in the next version and how they might affect application users. Because, these decisions are quite often not so straight forward: What are the reasons for tagging a new major version? Semantically, this is widely used to communicate changes to the external API that client applications might break (https://en.wikipedia.org/wiki/Software_versioning#Sequence-based_identifiers). If there are only changes in behavior, even clear improvements (like with this PR), this is something desired, but this will not affect client applications (like https://github.com/sirfz/tesserocr). Mixing versioning schemes is not considered a good practise (https://mitchdenny.com/dates-in-version-numbers/) |
@Shreeshrii Thanks, this is indeed very informative. I didn't knew this service before. Even extracts a change log! |
We switched to semver versioning some time ago, so of course the new release tag must match the semver rules.
|
Check grammar https://semver.org/
As I understand there are no any other "-" dash symbols except for splitting core and pre-release. |
We would use the pattern |
Dash can be part of the pre-release identifier:
|
Ah, I see, yes. |
Heads up for tag for proposed new 5.0.0-alpha-YYYYMMDD release |
Thanks for the reminder. As 4.1.1 already exists and there is no urgent need to fix those issues in 4.1.x, I moved the milestones to 5.0.0 now. One of the issues could also be closed. As long as there are only few people working on Tesseract, I see no chance for full support of more than a single major release. That means only critical bugs will be fixed in 4 by backporting of changes from 5. But this is open source development, so anybody can send pull requests to backport compatible changes to 4. Regarding the release tag, we don't mix. We use semantic versioning which allows pre-release identifiers with a date part. We also don't add a commit hash to the tag. Such commit hashs are added automatically to the Tesseract version string for builds which are not based on a tagged release. |
It's a bit complicated issue. We have two kinds of users:
The reason we did not release 5.0.0 yet is that we still plan to break libtesseract ABI, and we don't want to release 5.0.0 and then to release 6.0.0 with a new ABI 6/12 months later. But that leaves the regular users with the one year old 4.1.1, and we don't know when we will release 5.0.0. When people report about an issue they have with 4.1.1. many times we tell them: "Try 5.0.0...". There is also the LTS users with their old/very old versions of tesseract. Many users won't use an alpha/beta version on production, so they want a new stable release. I don't know what can we do about this issue in a way that will please most users. You should remember that we have a limited resources. Ignore the 'by Google" we still advertise. They don't participate in this project anymore. |
The new pre-release 5.0.0-alpha-20201224 is available now. Merry Christmas and thank you to everybody. |
Merry Christmas @stweil. Thank you and all the contributors. |
Maybe we should change the threshold to |
This should have be done for the 5.0.0 release, but I missed it because the issue was neither assigned to a project nor to a milestone. |
Can this change be done in 5.2.0, or do you prefer to change it in 6.0 ? |
Thanks for the reminder. I think it can be done in 5.2.0. |
I just try to implement such a parameter. It could make the existing parameter Should the new parameter be called |
Choose your preferred option. |
I bumped into this discussion. I have a image with both normal and inverted text that doesn't quite recognize well on the left bottom with language nld. However, when I invert it, all text seems to be recognized right, inverted or not... internetarchive/archive-pdf-tools#55 tesseract --version |
That image is recognized much better with model |
Is the double content in the resulting hocr by design? Should the recieving party pick the best option?
|
Is there double content? Then that's wrong. |
@rmast - I would test with just Tesseract, not other tools like OCRMyPDF that use Tesseract, and then see if you still get double content. I have never seen this - the only issue I've seen is diplopia (for which is there is an open MR), but that is only per character. |
tesseract 175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg out -l nld -c invert_threshold=0.9 hocr For example "wis - clear" twice at slightly the same spot, by the way both with sufficient but slightly different confidence.
|
Yes, I focused on that one instance by accident, the rest of the file doesn’t contain double content, so I agree.
|
Running |
@stweil, do you consider this issue a blocker for releasing 5.2.0? |
I tried to understand #3489 to look whether my fix would apply, but I guess #3489 has to do with the difference between automatic segmentation and 'straight' -psm 4 processing of a line that really isn't that straight. My issue was about inversion which has implementations in Tesseract on different levels. |
I’m reverting to your last commit before that commit, as that first older commit doesn’t compile.
4366d81
I want to confirm your git bisect-finding. Never heard of it…
I’ll use the old tessedit_do_invert=True.
|
The old code tries a 2nd OCR on lines when a word confidence is below 50 %, so a single word with low confidence triggers it. This happens rather often, mostly for lines without inverted text, and costs performance.
The new code checks the mean confidence instead of the minimum. This typically improves the performance significantly and still works for lines with much inverted text.
In addition, the OCR result of the inverted image is now accepted if its mean confidence is better than the original one. In my test this improves the OCR result, especially for lines which only have some words with inverted text.
The algorithm still does not handle lines with both normal and inverted text optimally. OCR for such lines could be improved by inverting word wise or maybe by using Leptonica function
pixAutoPhotoinvert
.