No text lines in ALTO output #13

stweil · 2020-01-11T21:40:46Z

The conversion of an example file from PAGE to ALTO creates an ALTO file without text lines:

java -jar PATH/PageConverter.jar -source-xml FILE_0063_OCR-D-OCR-TESSEROCR.xml -target-xml FILE_0063_OCR-D-OCR-TESSEROCR-ALTO.xml -convert-to ALTO

Is ALTO support still incomplete?

The text was updated successfully, but these errors were encountered:

chris1010010 · 2020-01-12T14:24:05Z

Hi, what format is the input file and what does it contain?

stweil · 2020-01-12T14:51:32Z

The input is the PAGE XML file https://digi.bib.uni-mannheim.de/~stweil/FILE_0063_OCR-D-OCR-TESSEROCR.xml (link was also given above). It contains the layout information and OCR results for a single page from one of our books.

chris1010010 · 2020-01-13T08:36:47Z

Okay, that explains it. Words (STRINGs) are mandatory in ALTO. So because there are no words in the PAGE XML, the ALTO exporter cannot add the text lines.

stweil · 2020-01-13T09:26:31Z

@kba, @bertsky, does that mean that we have to change ocrd-tesserocr-recognize to produce PAGE XML with words?

stweil · 2020-01-13T09:28:36Z

@chris1010010, thank you for your explanation.

wrznr · 2020-01-13T09:36:45Z

@stweil No. You can set textequiv_level to word.

bertsky · 2023-06-06T14:44:15Z

For anyone who traps into this fallacy from another angle (besides the special case ocrd-tesserocr-recognize):
Consider using https://github.com/kba/page-to-alto, which can do much more conversions than the PRImA library, including dummy printspace, dummy lines and dummy regions, if needed.

chris1010010 closed this as completed Jan 13, 2020

jbarth-ubhd mentioned this issue Feb 28, 2020

"ocr-transform page alto ... ...": loosing text UB-Mannheim/ocr-fileformat#123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No text lines in ALTO output #13

No text lines in ALTO output #13

stweil commented Jan 11, 2020 •

edited

Loading

chris1010010 commented Jan 12, 2020

stweil commented Jan 12, 2020 •

edited

Loading

chris1010010 commented Jan 13, 2020

stweil commented Jan 13, 2020 •

edited

Loading

stweil commented Jan 13, 2020

wrznr commented Jan 13, 2020

bertsky commented Jun 6, 2023

No text lines in ALTO output #13

No text lines in ALTO output #13

Comments

stweil commented Jan 11, 2020 • edited Loading

chris1010010 commented Jan 12, 2020

stweil commented Jan 12, 2020 • edited Loading

chris1010010 commented Jan 13, 2020

stweil commented Jan 13, 2020 • edited Loading

stweil commented Jan 13, 2020

wrznr commented Jan 13, 2020

bertsky commented Jun 6, 2023

stweil commented Jan 11, 2020 •

edited

Loading

stweil commented Jan 12, 2020 •

edited

Loading

stweil commented Jan 13, 2020 •

edited

Loading