-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"ocr-transform page alto ... ...": loosing text #123
Comments
ocr-transform page alto ... ...
loosing text
ocr-transform page alto ... ...
loosing text
no But calling |
I've checked the docs of the most recent JPageConverter:
|
Perhaps duplicate of PRImA-Research-Lab/prima-page-converter#13 |
Indeed, PAGE-ALTO conversion requires word segmentation. @maxnth Can you think of any sensible workaround? |
Did a quick-and-dirty script: https://gist.github.com/jbarth-ubhd/0e867c20008639145386a7978fdb27a4 |
Great but maybe we can integrate pseudo-word creation on-the-fly directly into the converter, with a cmdline flag. |
Word level PAGE XML output for calamari has already been planned for some time now but sadly we didn't get to actually implementing it yet. |
seems not to be fixed in v0.4.0. |
ocrd_calamari is at 1.0.0 and calamari at 1.0.5 but word-level PAGE output is indeed not implemented yet in calamari AFAICT |
ocrd_calamari (but AFAIK not Calamari yet) can produce word and glyph level segmentation since a year ago, it just does not do so by default. Sorry I didn't speak up earlier, I just didn't know about this issue here. @jbarth-ubhd You need to set ocrd_calamari's parameter Quoting ocrd_calamari's README:
ocrd_calamari does more than Calamari here because we wanted to include Calamari's glyph level infos, i.e. character positions and alternative (less probable) character predictions; and as PAGE XML has a strict line>word>glyph hierarchy, we needed to include a word segmentation. This word segmentation is inferred from the text, e.g. "Lorem ipsum dolor sit amet" becomes "Lorem| |ipsum| |dolor| |sit| |amet", strictly on spaces as expected by OCR-D's validation. |
I wasn't aware of that until now, good to know! And good it's already in ocrd_calamari, albeit originally for an entirely different reason. 😀 |
What prima-page-converter/ocr-fileformat could do, as far as I can tell from this issue: Give a user-friendly warning that there are no words in the PAGE document, so that ALTO conversion is not possible. |
No need for any of this, entirely, since we have been using https://github.com/kba/page-to-alto for this purpose instead since #134. I suggest closing (cannot do it myself). |
Example page generated with OCR-D ocrd-calamari-recognize
OCR_0007.zip
ocr-transform page hocr ... ... && ocr-transform hocr alto2.0 ... ...
instead is loosing page size.The text was updated successfully, but these errors were encountered: