-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML API: Add set_modifiable_text() for replacing text nodes. #7007
HTML API: Add set_modifiable_text() for replacing text nodes. #7007
Conversation
This patch introduces a new method, `set_modifiable_text()` to the Tag Processor, which makes it possible and safe to replace text nodes within an HTML document, performing the appropriate escaping. This method can be used in conjunction with other code to modify the text content of a document, and can be used for transforming HTML in a streaming fashion. Fixes Core-61617
Trac Ticket MissingThis pull request is missing a link to a Trac ticket. For a contribution to be considered, there must be a corresponding ticket in Trac. To attach a pull request to a Trac ticket, please include the ticket's full URL in your pull request description. More information about contributing to WordPress on GitHub can be found in the Core Handbook. |
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN:
To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't look that complicated to add support for replacing text nodes. This is very exciting!
There are some edge cases covered that I don't have enough insights to review quickly, but I'm looking forward to seeing these changed landed in core.
* | ||
* @return array[] | ||
*/ | ||
private static function data_tokens_with_basic_modifiable_text_updates() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I've done this before on accident. made the data provider public in 0340ee3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It all looks great to me.
* properly escape these things, but this could mask regex patterns | ||
* that previously worked. Resolve this by not sending `</script` | ||
*/ | ||
if ( false !== stripos( $plaintext_content, '</script' ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can’t locate a unit test covering this edge case. Is it included?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 944c8cb
Without it, replacing a longer string with a shorter one can make |
Thanks @adamziel - this level of review is so helpful! Unfortunately, when I added a test to try and trigger the condition, I couldn't. What do you think? Do you see any problem with this? |
This patch introduces a new method, `set_modifiable_text()` to the Tag Processor, which makes it possible and safe to replace text nodes within an HTML document, performing the appropriate escaping. This method can be used in conjunction with other code to modify the text content of a document, and can be used for transforming HTML in a streaming fashion. Developed in #7007 Discussed in https://core.trac.wordpress.org/ticket/61617 Props: dmsnell, gziolo, zieladam. Fixes #61617. git-svn-id: https://develop.svn.wordpress.org/trunk@58829 602fd350-edb4-49c9-b593-d223f7449a82
This patch introduces a new method, `set_modifiable_text()` to the Tag Processor, which makes it possible and safe to replace text nodes within an HTML document, performing the appropriate escaping. This method can be used in conjunction with other code to modify the text content of a document, and can be used for transforming HTML in a streaming fashion. Developed in WordPress/wordpress-develop#7007 Discussed in https://core.trac.wordpress.org/ticket/61617 Props: dmsnell, gziolo, zieladam. Fixes #61617. Built from https://develop.svn.wordpress.org/trunk@58829 git-svn-id: http://core.svn.wordpress.org/trunk@58225 1a063a9b-81f0-0310-95a4-ce76da25c4cd
This patch introduces a new method, `set_modifiable_text()` to the Tag Processor, which makes it possible and safe to replace text nodes within an HTML document, performing the appropriate escaping. This method can be used in conjunction with other code to modify the text content of a document, and can be used for transforming HTML in a streaming fashion. Developed in WordPress/wordpress-develop#7007 Discussed in https://core.trac.wordpress.org/ticket/61617 Props: dmsnell, gziolo, zieladam. Fixes #61617. Built from https://develop.svn.wordpress.org/trunk@58829 git-svn-id: https://core.svn.wordpress.org/trunk@58225 1a063a9b-81f0-0310-95a4-ce76da25c4cd
That test looks good. I'd also check for regular tags like |
@adamziel if you ever find this problem again I'll be happy to fix it ASAP |
I just isolated the problem @dmsnell: $p = new WP_HTML_Tag_Processor('Hello there');
$p->next_token();
$p->set_modifiable_text('Short');
echo $p->get_updated_html(); It's not occurring with this patch: - $this->bytes_already_parsed = $before_current_tag;
+ if ($this->get_token_type() === '#tag') {
+ $this->bytes_already_parsed = $before_current_tag;
+ } |
@adamziel, thank you for the report. I was able to create 3 failing tests where it doesn't work as expected:
I will work on the patch and improve the tests I quickly drafted: |
Prototypes a `wp_rewrite_urls()` URL rewriter for block markup to migrate the content from, say, `<a href="https://adamadam.blog">` to `<a href="https://adamziel.com/blog">`. * URL rewriting works to perhaps the greatest extent it ever did in WordPress migrations. * The URL parser requires PHP 8.1. This is fine for some Playground applications, but we'll need PHP 7.2+ compatibility to get it into WordPress core. * This PR features `WP_HTML_Tag_Processor` and `WP_HTML_Processor` to enable usage outside of WordPress core. ### Details This PR consists of a code ported from https://github.com/adamziel/site-transfer-protocol. It uses a cascade of parsers to pierce through the structured data in a WordPress post and replace the URLs matching the requested domain. The data flow is as follows: Parse HTML -> Parse block comments -> Parse attributes JSON -> Parse URLs On a high level, this parsing cascade is handled by the `WP_Block_Markup_Url_Processor` class: ```php $p = new WP_Block_Markup_Url_Processor( $block_markup, $base_url ); while ( $p->next_url() ) { $parsed_matched_url = $p->get_parsed_url(); // .. do processing $p->set_raw_url($new_raw_url); } ``` Getting more into details, the `WP_Block_Markup_Url_Processor` extends the `WP_HTML_Tag_Processor` class and walks the block markup token by token. It then drills down into: * Text nodes – where matches URLs using regexps. This part can be improved to avoid regular expressions. * Block comments – where it parses the block attributes and iterates through them, looking for ones that contain valid URLs * HTML tag attributes – where it looks for ones that are reserved for URLs (such as `<a href="">`, looking for ones that contain valid URLs The `next_url()` method moves through the stream of tokens, looking for the next match in one of the above contexts, and the `set_raw_url()` knows how to update each node type, e.g. block attributes updates are `json_encode()`-d. ### Processing tricky inputs When this code is fed into the migrator: ```html <!-- wp:paragraph --> <p> <!-- Inline URLs are migrated --> 🚀-science.com/science has the best scientific articles on the internet! We're also available via the punycode URL: <!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path --> https://xn---science-7f85g.com/%73%63ience/. <!-- Correctly ignores similar–but–different URLs --> This isn't migrated: https://🚀-science.comcast/science <br> Or this: super-🚀-science.com/science </p> <!-- /wp:paragraph --> <!-- Block attributes are migrated without any issue --> <!-- wp:image {"src": "https:\/\/\ud83d\ude80-\u0073\u0063ience.com/%73%63ience/wp-content/image.png"} --> <!-- As are URI HTML attributes --> <img src="https://xn---science-7f85g.com/science/wp-content/image.png"> <!-- /wp:image --> <!-- Classes are not migrated. --> <span class="https://🚀-science.com/science"></span> ``` This actual output is produced: ```html <!-- wp:paragraph --> <p> <!-- Inline URLs are migrated --> science.wordpress.com has the best scientific articles on the internet! We're also available via the punycode URL: <!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path --> https://science.wordpress.com/. <!-- Correctly ignores similar–but–different URLs --> This isn't migrated: https://🚀-science.comcast/science <br> Or this: super-🚀-science.com/science </p> <!-- /wp:paragraph --> <!-- Block attributes are migrated without any issue --> <!-- wp:image {"src":"https:\/\/science.wordpress.com\/wp-content\/image.png"} --> <!-- As are URI HTML attributes --> <img src="https://science.wordpress.com/wp-content/image.png"> <!-- /wp:image --> <!-- Classes are not migrated. --> <span class="https://🚀-science.com/science"></span> ``` ## Remaining work - [x] Add PHPCBF - [x] Get to zero CBF errors - [x] Get the unit tests to run in CI (e.g. run `composer install`) - [x] Add relevant unit tests coverage ## Follow-up work - [x] Patch `WP_HTML_Tag_Processor` in WordPress core, see WordPress/wordpress-develop#7007 (comment) - [ ] Package our copy of `WP_HTML_Tag_Processor` as a "WordPress polyfill" for standalone usage. - [ ] Make it compatible with PHP 7.2+ ## Testing Instructions (or ideally a Blueprint) CI runs the PHP unit tests. To run this on your local machine, do this: ```sh cd packages/playground/data-liberation composer install cd ../../../ nx test:watch playground-data-liberation ```
Trac ticket: Core-61617
This patch introduces a new method,
set_modifiable_text()
to the Tag Processor, which makes it possible and safe to replace text nodes within an HTML document, performing the appropriate escaping.This method can be used in conjunction with other code to modify the text content of a document, and can be used for transforming HTML in a streaming fashion.