"ocr-transform page alto ... ...": loosing text #123

jbarth-ubhd · 2020-02-28T11:36:56Z

Example page generated with OCR-D ocrd-calamari-recognize
OCR_0007.zip

ocr-transform page hocr ... ... && ocr-transform hocr alto2.0 ... ... instead is loosing page size.

The text was updated successfully, but these errors were encountered:

jbarth-ubhd · 2020-02-28T12:02:22Z

no open() syscall on any /usr/local/share/ocr-fileformat/xslt/* when doing strace -f.

But calling execve("/usr/bin/java", ["java", "-jar", "/usr/local/share/ocr-fileformat/vendor/JPageConverter/PageConverter.jar", "-neg-coords", "toZero", "-source-xml", "OCR_0007.xml", "-target-xml", "xxx", "-convert-to", "ALTO"], 0x5614283d4a10 /* 24 vars */) = 0

jbarth-ubhd · 2020-02-28T12:03:51Z

I've checked the docs of the most recent JPageConverter: -convert-to available versions:

LATEST
2013-07-15
2010-03-19
but not: ALTO ???

jbarth-ubhd · 2020-02-28T13:07:15Z

Perhaps duplicate of PRImA-Research-Lab/prima-page-converter#13

kba · 2020-02-28T13:46:10Z

Perhaps duplicate of PRImA-Research-Lab/prima-page-converter#13

Indeed, PAGE-ALTO conversion requires word segmentation. @maxnth Can you think of any sensible workaround?

jbarth-ubhd · 2020-02-28T14:06:27Z

Did a quick-and-dirty script: https://gist.github.com/jbarth-ubhd/0e867c20008639145386a7978fdb27a4

kba · 2020-02-28T14:10:57Z

Great but maybe we can integrate pseudo-word creation on-the-fly directly into the converter, with a cmdline flag.

maxnth · 2020-02-28T18:31:20Z

Word level PAGE XML output for calamari has already been planned for some time now but sadly we didn't get to actually implementing it yet.
It's one of my next tasks though and hopefully will get included in calamari within the upcoming month.
I don't know whether that's too late for this specific case but maybe the info that the feature is being worked on might help anyways.

jbarth-ubhd · 2020-12-21T11:13:54Z

seems not to be fixed in v0.4.0.

kba · 2020-12-21T11:42:07Z

seems not to be fixed in v0.4.0.

ocrd_calamari is at 1.0.0 and calamari at 1.0.5 but word-level PAGE output is indeed not implemented yet in calamari AFAICT

mikegerber · 2021-02-05T02:01:55Z

ocrd_calamari (but AFAIK not Calamari yet) can produce word and glyph level segmentation since a year ago, it just does not do so by default. Sorry I didn't speak up earlier, I just didn't know about this issue here.

@jbarth-ubhd You need to set ocrd_calamari's parameter -P textequiv_level word.

Quoting ocrd_calamari's README:

In addition to the line text it may also output word and glyph segmentation including per-glyph confidence values and per-glyph alternative predictions as provided by the Calamari OCR engine, using a textequiv_level of word or glyph. Note that while Calamari does not provide word segmentation, this processor produces word segmentation inferred from text segmentation and the glyph positions. The provided glyph and word segmentation can be used for text extraction and highlighting, but is probably not useful for further image-based processing.

ocrd_calamari does more than Calamari here because we wanted to include Calamari's glyph level infos, i.e. character positions and alternative (less probable) character predictions; and as PAGE XML has a strict line>word>glyph hierarchy, we needed to include a word segmentation. This word segmentation is inferred from the text, e.g. "Lorem ipsum dolor sit amet" becomes "Lorem| |ipsum| |dolor| |sit| |amet", strictly on spaces as expected by OCR-D's validation.

mikegerber · 2021-02-05T02:11:32Z

Indeed, PAGE-ALTO conversion requires word segmentation.

I wasn't aware of that until now, good to know! And good it's already in ocrd_calamari, albeit originally for an entirely different reason. 😀

mikegerber · 2021-02-05T12:22:52Z

What prima-page-converter/ocr-fileformat could do, as far as I can tell from this issue: Give a user-friendly warning that there are no words in the PAGE document, so that ALTO conversion is not possible.

bertsky · 2023-06-06T14:46:37Z

No need for any of this, entirely, since we have been using https://github.com/kba/page-to-alto for this purpose instead since #134.

I suggest closing (cannot do it myself).

jbarth-ubhd changed the title ~~ocr-transform page alto ... ... loosing text~~ ocr-transform page alto ... ... loosing text Feb 28, 2020

jbarth-ubhd changed the title ~~ocr-transform page alto ... ... loosing text~~ ocr-transform page alto ... ... loosing text Feb 28, 2020

jbarth-ubhd changed the title ~~ocr-transform page alto ... ... loosing text~~ "ocr-transform page alto ... ...": loosing text Feb 28, 2020

jbarth-ubhd closed this as completed Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"ocr-transform page alto ... ...": loosing text #123

"ocr-transform page alto ... ...": loosing text #123

jbarth-ubhd commented Feb 28, 2020

jbarth-ubhd commented Feb 28, 2020

jbarth-ubhd commented Feb 28, 2020 •

edited

Loading

jbarth-ubhd commented Feb 28, 2020

kba commented Feb 28, 2020

jbarth-ubhd commented Feb 28, 2020

kba commented Feb 28, 2020

maxnth commented Feb 28, 2020

jbarth-ubhd commented Dec 21, 2020

kba commented Dec 21, 2020

mikegerber commented Feb 5, 2021 •

edited

Loading

mikegerber commented Feb 5, 2021

mikegerber commented Feb 5, 2021

bertsky commented Jun 6, 2023

"ocr-transform page alto ... ...": loosing text #123

"ocr-transform page alto ... ...": loosing text #123

Comments

jbarth-ubhd commented Feb 28, 2020

jbarth-ubhd commented Feb 28, 2020

jbarth-ubhd commented Feb 28, 2020 • edited Loading

jbarth-ubhd commented Feb 28, 2020

kba commented Feb 28, 2020

jbarth-ubhd commented Feb 28, 2020

kba commented Feb 28, 2020

maxnth commented Feb 28, 2020

jbarth-ubhd commented Dec 21, 2020

kba commented Dec 21, 2020

mikegerber commented Feb 5, 2021 • edited Loading

mikegerber commented Feb 5, 2021

mikegerber commented Feb 5, 2021

bertsky commented Jun 6, 2023

jbarth-ubhd commented Feb 28, 2020 •

edited

Loading

mikegerber commented Feb 5, 2021 •

edited

Loading