-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML API: Allow additional fragment contexts. #7141
HTML API: Allow additional fragment contexts. #7141
Conversation
Previously, the fragment parser in WP_HTML_Processor has only allowed creating a fragment with the `<body>` context. In this patch, any context node is allowed.
41dc4aa
to
691d39e
Compare
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN:
To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
@sirreal it looks like we may have a lot of work to do before we can open this up to allowing self-contained elements as the fragment context. there are multiple challenges:
|
The fragment parser takes an Element. An HTML string seems like a poor representation for that 🙂 I wonder if the fragment parser could take a token (to create a fragment from an existing document) or some other representation that's not just a string in order to create fragments without needing to create a full document first.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you asked me, here are my thoughts.
One additional one. Do you want to add a blacklist of forbidden subnodes? Or do you think this is the responsibility of the developer?
Like preventing to add <html>
, <body>
to body
, or pretty much everything to title
.
Not sure wether this would be the right place to do so, but I had that thought and wanted to share it.
return null; | ||
} | ||
|
||
$context_processor = new WP_HTML_Tag_Processor( $context ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how deep you want to comment, but a comment here could help understanding what you were up to.
|
||
$context_tag = $context_processor->get_tag(); | ||
$context_attributes = array(); | ||
foreach ( $context_processor->get_attribute_names_with_prefix( '' ) as $name ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an empty prefix here looks strange.
If it was my code I'd either have the prefix optional/default to empty or add a get_attributes_names()
which would do exactly the same as get_attributes_names_with_prefix( '' )
, might be redundant in a way, but for me it would look a little more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @apermo. if this looks strange, it's supposed to 🙃
when we added this method we wanted it to be a bit awkward in the hopes of communicating that it involves additional costs. we wanted the default behavior to be looking for a subset of attributes (such as get_attribute_names_with_prefix( 'data-' )
) so that the API itself guides people to learn in the safest most performant manner.
* @param string $context Context element for the fragment, must be default of `<body>`. | ||
* @param string $encoding Text encoding of the document; must be default of 'UTF-8'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"must be default of ..."?
My internal autocomplete expected an "or" here. Which likely is equivalent, but I was expecting it.
And I think at least for $context
you need to update it, since your change was about allowing other than body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for noticing.
must be default of UTF-8
and must be default or UTF-8
are very different, whereas only the default value has been allowed (and the default is UTF-8). some day we might open it up to other values, but this is there to communicate intentionally that this is a UTF-8-only interface at the moment.
return null; | ||
} | ||
|
||
$processor = new static( $html, self::CONSTRUCTOR_UNLOCK_CODE ); | ||
$processor->state->context_node = array( 'BODY', array() ); | ||
$processor->state->context_node = array( $context_tag, $context_attributes ); | ||
$processor->state->insertion_mode = WP_HTML_Processor_State::INSERTION_MODE_IN_BODY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understood your PR correctly, it's about allowing insertion to anything other than body, or is body in this case ambivalent?
So content body vs <body>
?
Anyways, I'm uncertain wether this is intentional or if you forgot to touch this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fragment parser is synonymous with node.innerHTML = newHtml
in JavaScript or with a DOM. in this case, <body>
has been the default as a reasonably safe default for existing code, which likely is getting only a small chunk of the actual site HTML and then trying to process it.
so if you knew you were inside a <li>
you would create the fragment parser in the <li>
context and then the parse would change based on that. I probably don't fully understand this, because it's hard for me to come up with situations that lead to different parses, but it comes into play when inside SVGs or MathML elements, and when resetting some internals (the insertion mode).
basically this is not something most people will need to use, but it will be used by set_inner_html()
to ensure appropriate parsing.
This cleans up the actual parsing and makes it easier to accomodate self-contained context elements.
97288b3
to
4c652ad
Compare
@sirreal: upon further review I feel like this is not appropriate for self-contained and void elements, so I am rejecting them. there's never a reason, I think, to create a fragment parser on an element which can contain no child nodes (other than text, potentially). One case I was specifically thinking through is a So I think that long-term the only use for the fragment parser outside of the usual is when setting inner HTML, and in that case, it seems more fitting for that function to call out something special for elements like Apart from this I don't like exposing a token as the context node because that's cumbersome for people to write, and I don't expect people to want to go through the hassle of what that implies. Our tokens don't even contain attribute references, so this is not currently workable. However, it does potentially solve some issues around namespacing. There's no way in this function's argument to specify a context node in the |
I'll share some context from a conversation. As mentioned, it likely doesn't make sense to create a fragment parser in order to modify these "self-contained" tags and better interfaces can be provided. The specification for fragment parsers are always created from another document. This should be easy to do in the HTML API. However, most of the time WordPress is inspecting or modifying HTML snippets without considering their context. Rarely are full HTML documents considered. In order to continue supporting this behavior, we need to be able to create fragments without a parent document. This is where most of the difficulty arises. Most of the time, The context element will be used to establish the correct insertion mode and the context element's namespace and possibly attributes determine how tokens should be handled (using rules for foreign content or one of the HTML insertion modes). It may make sense to maintain this interface as-is.
|
I had a thought that may be helpful. I'd like to add an instance method to the HTML processor like Once we have this method, advanced fragment parser creation could be handled by that method: $full_parser = WP_HTML_Processor::create_full_parser( '<!DOCTYPE html><math><annotation-xml encoding="text/html">' );
$full_parser->next_tag( 'ANNOTATION-XML' );
$fragment_parser = $full_parser->create_fragment_parser_at_node( `<h1>Who knows what happens here?' );
$fragment_parser->next_tag( 'H1' );
// … I think internally this method would be used for things like set_inner_html that require a fragment parser. |
See #7348 |
Inspired by WordPress#7141 Co-authored-by: Dennis Snell <dennis.snell@automattic.com>
Inspired by WordPress#7141 Co-authored-by: Dennis Snell <dennis.snell@automattic.com>
#7777 is open for review. That would supersede this work. |
https://core.trac.wordpress.org/changeset/59467 / #7777 have landed. I think this PR can be closed. |
Trac-ticket: Core-61576.
Status
Add unit tests covering significant contexts, e.g. SCRIPT, TEXTAREA, SVG.set_inner_html()
which will be an alias in those cases forset_inner_text()
.<body>
is the only supported context.Description
Previously, the fragment parser in WP_HTML_Processor has only allowed creating a fragment with the
<body>
context. In this patch, any context node is allowed.