Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No text lines in ALTO output #13

Closed
stweil opened this issue Jan 11, 2020 · 7 comments
Closed

No text lines in ALTO output #13

stweil opened this issue Jan 11, 2020 · 7 comments

Comments

@stweil
Copy link

stweil commented Jan 11, 2020

The conversion of an example file from PAGE to ALTO creates an ALTO file without text lines:

java -jar PATH/PageConverter.jar -source-xml FILE_0063_OCR-D-OCR-TESSEROCR.xml -target-xml FILE_0063_OCR-D-OCR-TESSEROCR-ALTO.xml -convert-to ALTO

Is ALTO support still incomplete?

@chris1010010
Copy link
Contributor

Hi, what format is the input file and what does it contain?

@stweil
Copy link
Author

stweil commented Jan 12, 2020

The input is the PAGE XML file https://digi.bib.uni-mannheim.de/~stweil/FILE_0063_OCR-D-OCR-TESSEROCR.xml (link was also given above). It contains the layout information and OCR results for a single page from one of our books.

@chris1010010
Copy link
Contributor

Okay, that explains it. Words (STRINGs) are mandatory in ALTO. So because there are no words in the PAGE XML, the ALTO exporter cannot add the text lines.

@stweil
Copy link
Author

stweil commented Jan 13, 2020

@kba, @bertsky, does that mean that we have to change ocrd-tesserocr-recognize to produce PAGE XML with words?

@stweil
Copy link
Author

stweil commented Jan 13, 2020

@chris1010010, thank you for your explanation.

@wrznr
Copy link

wrznr commented Jan 13, 2020

@stweil No. You can set textequiv_level to word.

@bertsky
Copy link

bertsky commented Jun 6, 2023

For anyone who traps into this fallacy from another angle (besides the special case ocrd-tesserocr-recognize):
Consider using https://github.com/kba/page-to-alto, which can do much more conversions than the PRImA library, including dummy printspace, dummy lines and dummy regions, if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants