Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TableRegion should become ComposedBlock #1

Open
bertsky opened this issue Mar 31, 2021 · 7 comments
Open

TableRegion should become ComposedBlock #1

bertsky opened this issue Mar 31, 2021 · 7 comments
Assignees

Comments

@bertsky
Copy link
Collaborator

bertsky commented Mar 31, 2021

https://github.com/kba/page-to-alto/blob/46a8cc2fb74ce327e9d195f1095699cbae946cce/ocrd_page_to_alto/convert.py#L158

I think it's not enough to just map the lower levels here. There might not be any cell segmentation yet, only a detected table. And even if there is structure below that level, it's worthwhile mapping the recursive structure 1:1.

For that, there's the equivalent ComposedBlock in ALTO.

@kba
Copy link
Member

kba commented Apr 1, 2021

It is a ComposedBlock:

https://github.com/kba/page-to-alto/blob/46a8cc2fb74ce327e9d195f1095699cbae946cce/ocrd_page_to_alto/convert.py#L25

Since you're working with invoices and such, can you please share some samples for tables in PAGE-XML, then I can improve and test the table conversion.

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 1, 2021

It is a ComposedBlock:

Sorry, I was reading too sloppily.

Since you're working with invoices and such, can you please share some samples for tables in PAGE-XML, then I can improve and test the table conversion.

Sure. How about assets/data/gutachten/data?

@kba
Copy link
Member

kba commented Apr 7, 2021

For the sample gutachten/data/TEMP1/PAGE_TEMP1.xml, the current behavior seems to be correct:

<TableRegion>
  <TextRegion>
     <TextLine>

in PAGE becomes in ALTO:

<ComposedBlock>
  <TextBlock>
    <Textline>

I couldn't find a sample for a more complex table with deeper recursion than 1.

@kba
Copy link
Member

kba commented Apr 7, 2021

f138114 should support arbitrarily deep nesting in tables if I got the recursion right.

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 12, 2021

f138114 should support arbitrarily deep nesting in tables if I got the recursion right.

Yes, I think you did. But there are more cases: in PAGE, TextRegion can itself contain both nested TextRegions and immediate TextLines. And all region types are recursive, not just tables.

The problem is that in ALTO, TextBlock is not recursive, only ComposedBlock is. And ComposedBock is not allowed to have TextLines directly.

So you could (/probably need to) generalize the current pattern. But we would need to split up PAGE's "typed recursion" into ALTO's "pure recursion".

For example, if you have a GraphicRegion with embedded TextRegions, that would need to become a ComposedBlock comprised of an equally located/sized Illustration (which also maps its @type) followed by a list of TextBlocks for each embedded region.

Or if you have a TextRegion with immediate TextLines as well as embedded TextRegions, that would need to become a ComposedBlock comprised of an equally located TextBlock (with all the TextLines and its @type and @primaryLanguage), followed by a list of TextBlocks for the embedded regions.

Its unclear though, what to do with the TextEquiv at the region level (esp. if there's no line level below it) and other PAGE-specific info under TextRegion (like @leading / @align / @indented or @primaryScript or the order/direction attributes).

@kba
Copy link
Member

kba commented Apr 12, 2021

I'll try to implement basic and mixed-lines/regions recursion with ComposedBlock.

Its unclear though, what to do with the TextEquiv at the region level

There is nothing we can do I think. ALTO only allows content for String.

@leading could be mapped to @LINESPACE, @align is implemented via ParagraphStyle. @indented could be mapped to either @LEFT or @FIRSTLINE?

@kba
Copy link
Member

kba commented Mar 20, 2024

The behavior is buggy, it duplicates TextRegions within TableRegions in PAGE to a ComposedBlock and a TextBlock on the same level.

@kba kba self-assigned this Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants