Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table caption in DOCX attached to wrong table #9358

Closed
rgaiacs opened this issue Jan 22, 2024 · 2 comments
Closed

Table caption in DOCX attached to wrong table #9358

rgaiacs opened this issue Jan 22, 2024 · 2 comments

Comments

@rgaiacs
Copy link
Contributor

rgaiacs commented Jan 22, 2024

Consider table-mwe.docx that has two tables:

Screenshot 2024-01-22 175759

Let's convert the .docx to Markdown with pandoc --from docx --to markdown+table_captions table-mwe.docx.

Observed Output

The caption is connected to the first table instead of the second table.

Lorem ipsum

  -----------------------------------------------------------------------
  A                                   B
  ----------------------------------- -----------------------------------
  C                                   D

  -----------------------------------------------------------------------

  : Numbers from 1 to 4

Lorem ipsum

  -----------------------------------------------------------------------
  1                                   2
  ----------------------------------- -----------------------------------
  3                                   4

  -----------------------------------------------------------------------

Lorem ipsum

Expected Output

Lorem ipsum

  -----------------------------------------------------------------------
  A                                   B
  ----------------------------------- -----------------------------------
  C                                   D

  -----------------------------------------------------------------------

Lorem ipsum

  -----------------------------------------------------------------------
  1                                   2
  ----------------------------------- -----------------------------------
  3                                   4

  -----------------------------------------------------------------------

  : Numbers from 1 to 4

Lorem ipsum

Environment

pandoc --version returns

pandoc 3.1.11.1
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: /home/raniere/.local/share/pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
@rgaiacs rgaiacs added the bug label Jan 22, 2024
@jgm
Copy link
Owner

jgm commented Jan 22, 2024

The xml structure is:

w:tbl
w:p (Lorem ipsum)
w:p (with pStyle "Caption" and w:keepNext)
w:tbl

@jgm
Copy link
Owner

jgm commented Jan 22, 2024

OK, it looks like the docx reader does the following:

  • in bodyToOutput: looks for all the captions among the body paragraphs and puts a list of them in state
  • in bodyPartToBlocks for Tbl: gets the list of captions from state, takes the first one, and modifies state to contain the rest

So the captions are assigned to tables in the order they occur, no matter their proximity to the tables. Obviously that's giving bad results in this case, but it is a bit tricky to devise better heuristics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants