Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docx comments on tracked-changes insertions not handled properly #9833

Closed
steinbro opened this issue Jun 1, 2024 · 6 comments
Closed

docx comments on tracked-changes insertions not handled properly #9833

steinbro opened this issue Jun 1, 2024 · 6 comments
Labels

Comments

@steinbro
Copy link

steinbro commented Jun 1, 2024

When a Word comment applies to text that is an unaccepted track-changes insertion, that comment is not included in the pandoc output.

Here is a minimal example file, with two paragraphs, each with a single comment. One paragraph is normal text while the other paragraph is an insertion using track changes.

Converting to markdown (for example), only the comment from the first paragraph is preserved. Oddly, there are still two comment-end spans.

$ pandoc comments_test.docx -t markdown --track-changes=all

This is [This is a comment on normal text]{.comment-start id="1"
author="Daniel Steinbrook" date="2024-05-31T19:15:00Z"}some
[]{.comment-end id="1"} normal text[]{.paragraph-insertion
author="Daniel Steinbrook" date="2024-05-31T19:14:00Z"}

[This is some added]{.insertion author="Daniel Steinbrook"
date="2024-05-31T19:14:00Z"} [[]{.insertion author="Daniel Steinbrook"
date="2024-05-31T19:15:00Z"}]{.comment-end id="4"}[text]{.insertion
author="Daniel Steinbrook" date="2024-05-31T19:14:00Z"}

When going roundtrip from docx to json and back again, Word reports errors when opening the resulting docx.

$ pandoc -f json -t docx -o comments_test_roundtrip.docx <(pandoc comments_test.docx -t json --track-changes=all)

"Word found unreadable content in comments_test_roundtrip.docx. Do you want to recover the contents of this document? If you trust the source of this document, click Yes."

If you click Yes, the resulting document has two comments, but the second comment has no content or metadata.

Using pandoc 3.2 on macOS Sonoma 14.5.

@steinbro steinbro added the bug label Jun 1, 2024
@jgm
Copy link
Owner

jgm commented Jun 1, 2024

Here we have

      <w:ins w:id="3" w:author="Daniel Steinbrook" w:date="2024-05-31T19:14:00Z">
        <w:r>
          <w:t xml:space="preserve">This is some
          </w:t>
        </w:r>
        <w:commentRangeStart w:id="4"/>
        <w:r>
          <w:t xml:space="preserve">added
          </w:t>
        </w:r>
      </w:ins>
      <w:commentRangeEnd w:id="4"/>

As you can see, the w:commentRangeStart appears inside a closed w:ins element. I suspect that's the problem, but I'd have to look more closely at the docx parser. (I'm not familiar with the comment parsing code and the original author is no longer active.)

@steinbro
Copy link
Author

steinbro commented Jun 1, 2024

Hmm, maybe there's something weird about that particular document with the comment spanning the insertion boundary. But, the broader problem is that comments within the insertion do not show up in the converted output. Here's another examplewith two comments, where only one comment (the one not in the tracked-changes insertion) is preserved after pandoc conversion. I was expecting that both comments would be included when --track-changes=all is specified, since that mode should include both insertions and comments.

@jgm
Copy link
Owner

jgm commented Jun 1, 2024

Here's some relevant code, I think:
https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Readers/Docx/Parse.hs#L1030-L1033
The body of the ChangedRuns is derived from the children of the w:ins element via the function elemToRun -- and this function seems to ignore comment-starts.

@jgm
Copy link
Owner

jgm commented Jun 2, 2024

With this diff

diff --git a/src/Text/Pandoc/Readers/Docx.hs b/src/Text/Pandoc/Readers/Docx.hs
index 2737bb0c8..14e88fb07 100644
--- a/src/Text/Pandoc/Readers/Docx.hs
+++ b/src/Text/Pandoc/Readers/Docx.hs
@@ -385,22 +385,22 @@ parPartToInlines parPart =
 
 parPartToInlines' :: PandocMonad m => ParPart -> DocxContext m Inlines
 parPartToInlines' (PlainRun r) = runToInlines r
-parPartToInlines' (ChangedRuns (TrackedChange Insertion (ChangeInfo _ author date)) runs) = do
+parPartToInlines' (ChangedRuns (TrackedChange Insertion (ChangeInfo _ author date)) pparts) = do
   opts <- asks docxOptions
   case readerTrackChanges opts of
-    AcceptChanges -> smushInlines <$> mapM runToInlines runs
+    AcceptChanges -> smushInlines <$> mapM parPartToInlines pparts
     RejectChanges -> return mempty
     AllChanges    -> do
-      ils <- smushInlines <$> mapM runToInlines runs
+      ils <- smushInlines <$> mapM parPartToInlines pparts
       let attr = ("", ["insertion"], addAuthorAndDate author date)
       return $ spanWith attr ils
-parPartToInlines' (ChangedRuns (TrackedChange Deletion (ChangeInfo _ author date)) runs) = do
+parPartToInlines' (ChangedRuns (TrackedChange Deletion (ChangeInfo _ author date)) pparts) = do
   opts <- asks docxOptions
   case readerTrackChanges opts of
     AcceptChanges -> return mempty
-    RejectChanges -> smushInlines <$> mapM runToInlines runs
+    RejectChanges -> smushInlines <$> mapM parPartToInlines pparts
     AllChanges    -> do
-      ils <- smushInlines <$> mapM runToInlines runs
+      ils <- smushInlines <$> mapM parPartToInlines pparts
       let attr = ("", ["deletion"], addAuthorAndDate author date)
       return $ spanWith attr ils
 parPartToInlines' (CommentStart cmtId author date bodyParts) = do
diff --git a/src/Text/Pandoc/Readers/Docx/Parse.hs b/src/Text/Pandoc/Readers/Docx/Parse.hs
index f432bb64a..c8a4561ab 100644
--- a/src/Text/Pandoc/Readers/Docx/Parse.hs
+++ b/src/Text/Pandoc/Readers/Docx/Parse.hs
@@ -359,7 +359,7 @@ leftBiasedMergeRunStyle a b = RunStyle
 type Extent = Maybe (Double, Double)
 
 data ParPart = PlainRun Run
-             | ChangedRuns TrackedChange [Run]
+             | ChangedRuns TrackedChange [ParPart]
              | CommentStart CommentId Author (Maybe CommentDate) [BodyPart]
              | CommentEnd CommentId
              | BookMark BookMarkId Anchor
@@ -1027,7 +1027,7 @@ elemToParPart' ns element
     return $ map PlainRun runs
 elemToParPart' ns element
   | Just change <- getTrackedChange ns element = do
-      runs <- mconcat <$> mapD (elemToRun ns) (elChildren element)
+      runs <- mconcat <$> mapD (elemToParPart' ns) (elChildren element)
       return [ChangedRuns change runs]
 elemToParPart' ns element
   | isElem ns "w" "bookmarkStart" element

I get the following native output from your sample. Can you confirm that this is correct?

[ Para
    [ Str "This"
    , Space
    , Str "is"
    , Space
    , Span
        ( ""
        , [ "comment-start" ]
        , [ ( "id" , "1" )
          , ( "author" , "Daniel Steinbrook" )
          , ( "date" , "2024-05-31T19:15:00Z" )
          ]
        )
        [ Str "This"
        , Space
        , Str "is"
        , Space
        , Str "a"
        , Space
        , Str "comment"
        , Space
        , Str "on"
        , Space
        , Str "normal"
        , Space
        , Str "text"
        ]
    , Str "some"
    , Space
    , Span ( "" , [ "comment-end" ] , [ ( "id" , "1" ) ] ) []
    , Space
    , Str "normal"
    , Space
    , Str "text"
    , Span
        ( ""
        , [ "paragraph-insertion" ]
        , [ ( "author" , "Daniel Steinbrook" )
          , ( "date" , "2024-05-31T19:14:00Z" )
          ]
        )
        []
    ]
, Para
    [ Span
        ( ""
        , [ "insertion" ]
        , [ ( "author" , "Daniel Steinbrook" )
          , ( "date" , "2024-05-31T19:14:00Z" )
          ]
        )
        [ Str "This"
        , Space
        , Str "is"
        , Space
        , Str "some"
        , Space
        , Span
            ( ""
            , [ "comment-start" ]
            , [ ( "id" , "4" )
              , ( "author" , "Daniel Steinbrook" )
              , ( "date" , "2024-05-31T19:15:00Z" )
              ]
            )
            [ Str "This"
            , Space
            , Str "is"
            , Space
            , Str "a"
            , Space
            , Str "comment"
            , Space
            , Str "on"
            , Space
            , Str "added"
            , Space
            , Str "text"
            ]
        , Str "added"
        ]
    , Space
    , Span
        ( "" , [ "comment-end" ] , [ ( "id" , "4" ) ] )
        [ Span
            ( ""
            , [ "insertion" ]
            , [ ( "author" , "Daniel Steinbrook" )
              , ( "date" , "2024-05-31T19:15:00Z" )
              ]
            )
            []
        ]
    , Span
        ( ""
        , [ "insertion" ]
        , [ ( "author" , "Daniel Steinbrook" )
          , ( "date" , "2024-05-31T19:14:00Z" )
          ]
        )
        [ Str "text" ]
    ]
]

@steinbro
Copy link
Author

steinbro commented Jun 2, 2024

That is potentially correct, to the best of my limited knowledge of how the insertion spans should be arranged. It certainly includes the full comment text within the insertion. Thanks for the impressively quick patch! When might this become part of an official pandoc build?

@jgm jgm closed this as completed in 9a7fd6e Jun 2, 2024
@jgm
Copy link
Owner

jgm commented Jun 2, 2024

It will be in the next release. Not sure when that will be, but if you're desperate you can compile from source or use a nightly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants