-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decode HTML entities created by block parser matcher reliance on innerHTML #25120
Conversation
Hmmm...looks like we have failing unit tests so this could well be unsuitable. |
Size Change: +48 B (0%) Total Size: 1.26 MB
ℹ️ View Unchanged
|
These unit tests are failing
This is because the |
Is the problem specific to reusable blocks? Are there problems anywhere else (with steps to reproduce)? |
@ellatrix Sorry for the delay here.
No, on
If you add as a reusable block the same thing happens but it's worse because the Reusable Blocks are parsed in the background and then there are errors in the console because you have a Reusable block saved.
I'm yet to witness any. We do have integration tests failing though so let me look into those and come back to you. |
TestingTest ExpectationsThe serialized form of the block's content (ie: that which is persisted) should be:
The parsed form of the block (ie: that displayed in the editor) should be:
Testing results for other blocks
|
0cdd361
to
bbc2590
Compare
bbc2590
to
12f0370
Compare
Update: i've alter the PR to only convert the specific HTML entities that are not converted by Element.innerHTML namely:
There's a simple lookup which converts them to their correct form. This preserves existing functionality (all unit tests should now pass - we'll see!) and also fixes the test cases shown above. The reason for this is because now all of the entities have been converted the parsed block content matches the saved block content and thus the block validation passes. The best thing is that it also fixes another bug with I'll write this up, add test cases and unit tests and then ask for code review. |
I've just noticed this PR introduces a regression with the
My understanding is that it's best to sanitise data on output (rendering) not on input (saving to DB). I'll dig deeper shortly... |
@dmsnell I think you've come across something similar recently? |
056b585
to
8261b03
Compare
8261b03
to
bc5b16d
Compare
It looks like the The reason
...is precisely to avoid it being rendered as HTML in order that it can display in raw form. |
@@ -38,6 +52,6 @@ export function html( selector, multilineTag ) { | |||
return value; | |||
} | |||
|
|||
return match.innerHTML; | |||
return replaceInnerHTMLEntities( match.innerHTML ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can change this behaviour at this point. That's why I added a new type. Otherwise we could just return the raw html value here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played with this a bit more today. You're correct we cannot change the behaviour. It even breaks Core block let alone any third party blocks.
I'm now more convinced than ever that having a raw
source type such as you propose is the best solution.
Closing this based on discussion here. IMHO the proposal to add a new |
This PR fixes an issue with Block parsing as evidenced by #24282 whereby
<
characters are converted into the HTML entity equivalent<
thus causing a block validation error in the HTML block.In addition it fixes the issue whereby
&
and>
characters in the HTML block are converted into their entity forms.See also #23509.
Description
Essentially on
trunk
right now, if you have a HTML block which contains the literal<
character then it will cause a block validation error when the block is parsed.Try it now...
3 < 4
.Screen.Capture.on.2021-09-14.at.17-55-00.mov
Why does this happen?
It's complex...but the TL;DR is
save
content is different from that in the post body.3 < 4
(correct)3 < 4
(does not match!)<
which is<
.html
matcher which is used to parse block's identified as containing HTML utilises.innerHTML
to read the content from the match.Element.innerHTML
returns the following characters as HTML entities:&
,<
, or>
.So why does the validation error only happen for
<
?A good question!
The reason appears to be that
isEquivalentHTML()
method used to determine whether blocks are valid utilisesgetHTMLTokens()
which relies onTokenizer
from thenpm
package `'simple-html-tokenizer'.<
from3 < 4
as being an opening tag. See Fails on < inside a <pre> tag tildeio/simple-html-tokenizer#20 for more details.isEquivalentHTML
will returnfalse
.<
character the other two characters (&
and>
) do not cause validation errors.What bugs do & and > cause?
&
and>
are still converted to their entity forms there is an inconsistency in the HTML block as the original content&
becomes&
and>
becomes>
. This is not what the user wants nor expects.Screen.Capture.on.2021-09-14.at.17-53-01.mp4
Solution - how does this PR solve these issues?
We need to ensure that
&
,<
and>
are converted into their string form. However, we do not want to convert all HTML entities as this would cause things such as
to be unintentionally converted.Therefore our fix acts in a limited way and only transforms those characters that are specifically returned as HTML entities from
Element.innerHTML
.Because the chars are converted at the base "matcher" level of the block parsers all the block validation issues are resolved. Moreover, the content the user entered is preserved (ie: not converted into entity form).
How has this been tested?
On Master
3 < 4
.This PR
3 < 4
.Types of changes
Bug fix (non-breaking change which fixes an issue)
Checklist:
Original Issue Content
Essentially if you have a HTML block which contains
3 < 4
it will cause validation errors because the parsed version of the block'ssave
content is different from that in the post body.3 < 4
The serialized block content is actually
3 < 4
. It is only when this content is parsed that it is converted into a string containing HTML entities.Note I believe this occurs because the
html()
matcher passed tohpqParse
relies on using.innerHTML
to retrieve the content of the parsed HTML.For the
match
(DOM node) which is passed intohtml
, some light debugging reveals:match.innerHTML
=>3 < 4
match.innerText
=>3 < 4
I understand this difference is due to the way
innerHTML
works - indeed the MDN reference forinnerHTML
says:I believe this is why the
<
gets encoded to<
.However, in the matcher implementation
match.innerHTML
is returned which means the string with encoded entities is returned.Wrapping this string in
@wordpess/html-entities
'sdecodeEntities()
ensures that any entities are converted back to their string form.Note for @ellatrix: I could be way off on this one. I just narrowed down the cause of the specific issue but that doesn't mean it doesn't have lots of knock-on effects. I'm hopeful that as we're just fixing the
innerHTML
returned it should be ok, but this will needs lots of testing with content variations.Detailed Explanation
The block parser works roughly as follows:
createBlockWithFallback
is called with the block node in question. In our use case it looks like this:Notice how
innerHTML
andinnerContent
are correct as3 < 4
but the block attributes have yet to be parsed.createBlockWithFallback
thecreateBlock
function is calledgutenberg/packages/blocks/src/api/parser.js
Lines 530 to 534 in 0d2771b
This in turn calls
getBlockAttributes
which callsgetBlockAttribute
to map all the block attributes:gutenberg/packages/blocks/src/api/parser.js
Lines 291 to 301 in 0d2771b
getBlockAttribute
is passed anattributeSchema
argument that has the following shape/content:...which it uses to determine how it should parse the attributes.
Depending on the source type (in our case
html
),parseWithAttributeSchema
is then called with theattrubuteSchema
above.This in turn hands off to
hpqParse
from thehpq
library.gutenberg/packages/blocks/src/api/parser.js
Lines 219 to 221 in 0d2771b
This takes a 2nd argument of the
matcher
function to use inhpqParse
. This is determine by a call tomatcherFromSource
which uses thesource
type (in our casehtml
) to retrieve the appropriatematcher
.gutenberg/packages/blocks/src/api/parser.js
Lines 184 to 185 in 0d2771b
In our case this is the
html
matcher.html
matcher usesinnerHTML
to retrieve the content for use in the parsed attribute:gutenberg/packages/blocks/src/api/matchers.js
Lines 40 to 41 in 0d2771b
This is where the issue described above occurs. Namely:
I believe this is why the
<
gets encoded to<
. But this only happens for blocks which have a source type ofhtml
. As far as I'm aware that is just thecore/html
block.createBlockWithFallback
function,createBlock
has now returned and we pass the result to thegetBlockContentValidationResult
function to determine whether the block is considered valid.gutenberg/packages/blocks/src/api/parser.js
Lines 530 to 548 in 0d2771b
At this point the
block
object returned fromcreateBlock
looks like thisand the
innerHTML
is"3 < 4"
.getBlockContentValidationResult
function is called the result is that the block is determined to be invalid due to the mismatch between the saved block content and the value ofblock.attributes.content
:gutenberg/packages/blocks/src/api/validation/index.js
Lines 690 to 694 in cecabd0
In the code above:
originalBlockContent
is3 < 4
generatedBlockContent
is3 < 4
The reason why this fails validation is likely due to the reliance on
simple-html-tokenizer
. It appears that it will "see" the<
from3 < 4
as being an opening tag. See tildeio/simple-html-tokenizer#20 for more details.We can avoid this by correctly escaping entities.
The validation of a block checks whether the following two items are "equivalent HTML"
block.originalContent
.generatedBlockContent
- returned by callinggetSaveContent( blockType, block.attributes );
.This is achieved via the
isEquivalentHTML
method and if this does not returntrue
then the block is considered invalid.Description