Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fixtures for MS Word, and properly handle newlines #17459

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions packages/blocks/src/api/raw-handling/ms-newline-normaliser.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
export default function( node ) {
if ( node.nodeName !== 'P' ) {
return;
}

if ( ! [ ...node.classList ].find( ( className ) => className.match( /^Mso[A-Z].*/ ) ) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is a unique MS Word problem? More specifically a paragraph problem? In my opinion, we should be writing a general line break (\n) collapser, as line breaks (\n) in HTML are not visible in any circumstance. The are only used to format the HTML.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are, for sure, other tools that do weird things with line breaks. However, this PR (and the related issue) is specific to pasting content from MS Word, where one gets additional line breaks that have been injected by MS Word itself.

It might be true that HTML, or rather the browser, doesn't care about newlines. But they will also end up in content.raw in the REST API response. And this is where things can break.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are other sources that may inject line breaks. That's why I'm suggesting to look for a more general fix. See #17470.

return;
}

node.innerHTML = node.innerHTML.replace( /(?<!<br>)\n/g, ' ' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we trying to strip line breaks (\n)? Why should it not be preceded by a line break element?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, yes! 🙂 This is exactly what this is about.

Please refer to the initial comment of the related issue.

This behavior, as well as the fix/handling for it, is specific to pasting content from MS Word. As you can see in the mentioned issue, MS Word copies content formatted with a maximum line-length of 80 characters. So there are newlines injected by MS Word itself. Any actual user-defined linebreak is actually a <br>, followed by a newline. And we do not want to replace that.

Does that make sense?

}
2 changes: 2 additions & 0 deletions packages/blocks/src/api/raw-handling/paste-handler.js
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ import isInlineContent from './is-inline-content';
import phrasingContentReducer from './phrasing-content-reducer';
import headRemover from './head-remover';
import msListConverter from './ms-list-converter';
import msNewlineNormaliser from './ms-newline-normaliser';
import listReducer from './list-reducer';
import imageCorrector from './image-corrector';
import blockquoteNormaliser from './blockquote-normaliser';
Expand Down Expand Up @@ -201,6 +202,7 @@ export function pasteHandler( { HTML = '', plainText = '', mode = 'AUTO', tagNam
const filters = [
googleDocsUIDRemover,
msListConverter,
msNewlineNormaliser,
headRemover,
listReducer,
imageCorrector,
Expand Down
16 changes: 12 additions & 4 deletions test/integration/fixtures/ms-word-in.html
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@

<p class=MsoTitle><span lang=EN-US style='mso-ansi-language:EN-US'>This is a
title<o:p></o:p></span></p>
<p class=MsoTitle><span lang=EN-US style='mso-ansi-language:EN-US'>This is a title<o:p></o:p></span></p>

<p class=MsoNormal><span lang=EN-US style='mso-ansi-language:EN-US'><o:p>&nbsp;</o:p></span></p>

<p class=MsoSubtitle><span lang=EN-US style='mso-ansi-language:EN-US'>This is a
subtitle<o:p></o:p></span></p>
<p class=MsoSubtitle><span lang=EN-US style='mso-ansi-language:EN-US'>This is a subtitle<o:p></o:p></span></p>

<p class=MsoNormal><span lang=EN-US style='mso-ansi-language:EN-US'><o:p>&nbsp;</o:p></span></p>

Expand All @@ -21,6 +19,16 @@ <h2><span lang=EN-US style='mso-ansi-language:EN-US'>This is a heading level 2<o
style='mso-bidi-font-weight:normal'>paragraph</b> with a <a
href="https://w.org/">link</a>.<o:p></o:p></span></p>

<p class=MsoNormal><span lang=EN-US style='mso-ansi-language:EN-US'>This is a <b
style='mso-bidi-font-weight:normal'>longer paragraph</b> with quite a few lines
of text to be able to check the correct handling of all the newlines, which are
automatically injected by Microsoft Word, but which we do not want to land in the
editor.<o:p></o:p></span></p>

<p class=MsoNormal><span lang=EN-US style='mso-ansi-language:EN-US'>This is a <b
style='mso-bidi-font-weight:normal'>paragraph</b> with a manual<br>
newline.<o:p></o:p></span></p>

<p class=MsoNormal><span lang=EN-US style='mso-ansi-language:EN-US'><o:p>&nbsp;</o:p></span></p>

<p class=MsoListParagraphCxSpFirst style='text-indent:-18.0pt;mso-list:l0 level1 lfo1'><![if !supportLists]><span
Expand Down
15 changes: 11 additions & 4 deletions test/integration/fixtures/ms-word-out.html
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
<!-- wp:paragraph -->
<p>This is a
title</p>
<p>This is a title</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>This is a
subtitle</p>
<p>This is a subtitle</p>
<!-- /wp:paragraph -->

<!-- wp:heading {"level":1} -->
Expand All @@ -20,6 +18,15 @@ <h2>This is a heading level 2</h2>
<p>This is a <strong>paragraph</strong> with a <a href="https://w.org/">link</a>.</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>This is a <strong>longer paragraph</strong> with quite a few lines of text to be able to check the correct handling of all the newlines, which are automatically injected by Microsoft Word, but which we do not want to land in the editor.</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>This is a <strong>paragraph</strong> with a manual<br>
newline.</p>
<!-- /wp:paragraph -->

<!-- wp:list -->
<ul><li>A</li><li>Bulleted<ul><li>Indented</li></ul></li><li>List</li></ul>
<!-- /wp:list -->
Expand Down