Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Add set_modifiable_text() for replacing text nodes. #7007

Closed

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Jul 10, 2024

Trac ticket: Core-61617

This patch introduces a new method, set_modifiable_text() to the Tag Processor, which makes it possible and safe to replace text nodes within an HTML document, performing the appropriate escaping.

This method can be used in conjunction with other code to modify the text content of a document, and can be used for transforming HTML in a streaming fashion.

This patch introduces a new method, `set_modifiable_text()` to the
Tag Processor, which makes it possible and safe to replace text nodes
within an HTML document, performing the appropriate escaping.

This method can be used in conjunction with other code to modify the
text content of a document, and can be used for transforming HTML
in a streaming fashion.

Fixes Core-61617
Copy link

Trac Ticket Missing

This pull request is missing a link to a Trac ticket. For a contribution to be considered, there must be a corresponding ticket in Trac.

To attach a pull request to a Trac ticket, please include the ticket's full URL in your pull request description. More information about contributing to WordPress on GitHub can be found in the Core Handbook.

Copy link

github-actions bot commented Jul 10, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, gziolo, zieladam.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

Copy link
Member

@gziolo gziolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't look that complicated to add support for replacing text nodes. This is very exciting!

There are some edge cases covered that I don't have enough insights to review quickly, but I'm looking forward to seeing these changed landed in core.

*
* @return array[]
*/
private static function data_tokens_with_basic_modifiable_text_updates() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some CI failures:

Screenshot 2024-07-12 at 09 36 20

Does it have to be a public method maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I've done this before on accident. made the data provider public in 0340ee3

Copy link
Member

@gziolo gziolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all looks great to me.

* properly escape these things, but this could mask regex patterns
* that previously worked. Resolve this by not sending `</script`
*/
if ( false !== stripos( $plaintext_content, '</script' ) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can’t locate a unit test covering this edge case. Is it included?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 944c8cb

@adamziel
Copy link
Contributor

get_updated_html() might need this adjustment:

		if ($this->get_token_type() === '#tag') {
			$this->bytes_already_parsed = $before_current_tag;
		}

Without it, replacing a longer string with a shorter one can make $this->bytes_already_parsed negative.

@dmsnell
Copy link
Member Author

dmsnell commented Jul 26, 2024

get_updated_html() might need this adjustment:

Thanks @adamziel - this level of review is so helpful! Unfortunately, when I added a test to try and trigger the condition, I couldn't. What do you think? Do you see any problem with this?

https://github.com/WordPress/wordpress-develop/pull/7007/files#diff-ef817e9b61dcdc18d5e1a84bfbb55370916b179cfed01eab03121456d7ce847cR42-R86

pento pushed a commit that referenced this pull request Jul 29, 2024
This patch introduces a new method, `set_modifiable_text()` to the
Tag Processor, which makes it possible and safe to replace text nodes
within an HTML document, performing the appropriate escaping.

This method can be used in conjunction with other code to modify the
text content of a document, and can be used for transforming HTML
in a streaming fashion.

Developed in #7007
Discussed in https://core.trac.wordpress.org/ticket/61617

Props: dmsnell, gziolo, zieladam.
Fixes #61617.



git-svn-id: https://develop.svn.wordpress.org/trunk@58829 602fd350-edb4-49c9-b593-d223f7449a82
@dmsnell
Copy link
Member Author

dmsnell commented Jul 29, 2024

Merged in [58829]
c6aaf0a

@dmsnell dmsnell closed this Jul 29, 2024
@dmsnell dmsnell deleted the html-api/add-set-modifiable-text branch July 29, 2024 17:58
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Jul 29, 2024
This patch introduces a new method, `set_modifiable_text()` to the
Tag Processor, which makes it possible and safe to replace text nodes
within an HTML document, performing the appropriate escaping.

This method can be used in conjunction with other code to modify the
text content of a document, and can be used for transforming HTML
in a streaming fashion.

Developed in WordPress/wordpress-develop#7007
Discussed in https://core.trac.wordpress.org/ticket/61617

Props: dmsnell, gziolo, zieladam.
Fixes #61617.


Built from https://develop.svn.wordpress.org/trunk@58829


git-svn-id: http://core.svn.wordpress.org/trunk@58225 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request Jul 29, 2024
This patch introduces a new method, `set_modifiable_text()` to the
Tag Processor, which makes it possible and safe to replace text nodes
within an HTML document, performing the appropriate escaping.

This method can be used in conjunction with other code to modify the
text content of a document, and can be used for transforming HTML
in a streaming fashion.

Developed in WordPress/wordpress-develop#7007
Discussed in https://core.trac.wordpress.org/ticket/61617

Props: dmsnell, gziolo, zieladam.
Fixes #61617.


Built from https://develop.svn.wordpress.org/trunk@58829


git-svn-id: https://core.svn.wordpress.org/trunk@58225 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@adamziel
Copy link
Contributor

adamziel commented Jul 31, 2024

Thanks @adamziel - this level of review is so helpful! Unfortunately, when I added a test to try and trigger the condition, I couldn't. What do you think? Do you see any problem with this?

That test looks good. I'd also check for regular tags like DIV and then for HTML comments. I ran into that in https://github.com/adamziel/site-transfer-protocol, although now can't find the commit where I did it. Eventually I've switched to a different method of setting modifiable text, and now I'm happy to migrate to this now that it's merged. Lovely work and thank you!

@dmsnell
Copy link
Member Author

dmsnell commented Jul 31, 2024

@adamziel if you ever find this problem again I'll be happy to fix it ASAP

@adamziel
Copy link
Contributor

Thanks @adamziel - this level of review is so helpful! Unfortunately, when I added a test to try and trigger the condition, I couldn't. What do you think? Do you see any problem with this?

I just isolated the problem @dmsnell:

$p = new WP_HTML_Tag_Processor('Hello there');
$p->next_token();
$p->set_modifiable_text('Short');
echo $p->get_updated_html();

It's not occurring with this patch:

-		$this->bytes_already_parsed = $before_current_tag;
+		if ($this->get_token_type() === '#tag') {
+			$this->bytes_already_parsed = $before_current_tag;
+		}

@gziolo
Copy link
Member

gziolo commented Oct 16, 2024

@adamziel, thank you for the report. I was able to create 3 failing tests where it doesn't work as expected:

There were 3 failures:

1) Tests_HtmlApi_WpHtmlTagProcessorModifiableText::test_get_modifiable_text_is_consistent_after_writes_when_text_shorter
Should have found updated text.
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'shorter text'
+'xt'

/var/www/tests/phpunit/tests/html-api/wpHtmlTagProcessorModifiableText.php:116
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:106

2) Tests_HtmlApi_WpHtmlTagProcessorModifiableText::test_get_modifiable_text_is_consistent_after_writes_when_text_longer
Should have found updated text.
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'a bit longer text'
+'it longer text'

/var/www/tests/phpunit/tests/html-api/wpHtmlTagProcessorModifiableText.php:155
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:106

3) Tests_HtmlApi_WpHtmlTagProcessorModifiableText::test_get_modifiable_text_is_consistent_after_writes_when_text_after_closed_tag_element
Should have found updated text.
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'<p>some content</p>a bit longer text'
+'it longer text'

/var/www/tests/phpunit/tests/html-api/wpHtmlTagProcessorModifiableText.php:197
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:106

I will work on the patch and improve the tests I quickly drafted:

adamziel added a commit to WordPress/wordpress-playground that referenced this pull request Oct 28, 2024
Prototypes a `wp_rewrite_urls()` URL rewriter for block markup to
migrate the content from, say, `<a href="https://adamadam.blog">` to `<a
href="https://adamziel.com/blog">`.

* URL rewriting works to perhaps the greatest extent it ever did in
WordPress migrations.
* The URL parser requires PHP 8.1. This is fine for some Playground
applications, but we'll need PHP 7.2+ compatibility to get it into
WordPress core.
* This PR features `WP_HTML_Tag_Processor` and `WP_HTML_Processor` to
enable usage outside of WordPress core.

### Details

This PR consists of a code ported from
https://github.com/adamziel/site-transfer-protocol. It uses a cascade of
parsers to pierce through the structured data in a WordPress post and
replace the URLs matching the requested domain.

The data flow is as follows:

Parse HTML -> Parse block comments -> Parse attributes JSON -> Parse
URLs

On a high level, this parsing cascade is handled by the
`WP_Block_Markup_Url_Processor` class:

```php
$p = new WP_Block_Markup_Url_Processor( $block_markup, $base_url );
while ( $p->next_url() ) {
	$parsed_matched_url = $p->get_parsed_url();
	// .. do processing
	$p->set_raw_url($new_raw_url);
}
```

Getting more into details, the `WP_Block_Markup_Url_Processor` extends
the `WP_HTML_Tag_Processor` class and walks the block markup token by
token. It then drills down into:

* Text nodes – where matches URLs using regexps. This part can be
improved to avoid regular expressions.
* Block comments – where it parses the block attributes and iterates
through them, looking for ones that contain valid URLs
* HTML tag attributes – where it looks for ones that are reserved for
URLs (such as `<a href="">`, looking for ones that contain valid URLs

The `next_url()` method moves through the stream of tokens, looking for
the next match in one of the above contexts, and the `set_raw_url()`
knows how to update each node type, e.g. block attributes updates are
`json_encode()`-d.

### Processing tricky inputs

When this code is fed into the migrator:

```html
<!-- wp:paragraph -->
<p>
	<!-- Inline URLs are migrated -->
	🚀-science.com/science has the best scientific articles on the internet! We're also
	available via the punycode URL:
	
	<!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path -->
	&#104;ttps://xn---&#115;&#99;ience-7f85g.com/%73%63ience/.
	
	<!-- Correctly ignores similar–but–different URLs -->
	This isn't migrated: https://🚀-science.comcast/science <br>
	Or this: super-🚀-science.com/science
</p>
<!-- /wp:paragraph -->

<!-- Block attributes are migrated without any issue -->
<!-- wp:image {"src": "https:\/\/\ud83d\ude80-\u0073\u0063ience.com/%73%63ience/wp-content/image.png"} -->
<!-- As are URI HTML attributes -->
<img src="&#104;ttps://xn---&#115;&#99;ience-7f85g.com/science/wp-content/image.png">
<!-- /wp:image -->

<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>
```

This actual output is produced:

```html
<!-- wp:paragraph -->
<p>
	<!-- Inline URLs are migrated -->
	science.wordpress.com has the best scientific articles on the internet! We're also
	available via the punycode URL:
	
	<!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path -->
	https://science.wordpress.com/.
	
	<!-- Correctly ignores similar–but–different URLs -->
	This isn't migrated: https://🚀-science.comcast/science <br>
	Or this: super-🚀-science.com/science
</p>
<!-- /wp:paragraph -->

<!-- Block attributes are migrated without any issue -->
<!-- wp:image {"src":"https:\/\/science.wordpress.com\/wp-content\/image.png"} -->
<!-- As are URI HTML attributes -->
<img src="https://science.wordpress.com/wp-content/image.png">
<!-- /wp:image -->

<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>
```

## Remaining work

- [x] Add PHPCBF
- [x] Get to zero CBF errors
- [x] Get the unit tests to run in CI (e.g. run `composer install`)
- [x] Add relevant unit tests coverage

## Follow-up work

- [x] Patch `WP_HTML_Tag_Processor` in WordPress core, see
WordPress/wordpress-develop#7007 (comment)
- [ ] Package our copy of `WP_HTML_Tag_Processor` as a "WordPress
polyfill" for standalone usage.
- [ ] Make it compatible with PHP 7.2+

## Testing Instructions (or ideally a Blueprint)

CI runs the PHP unit tests. To run this on your local machine, do this:

```sh
cd packages/playground/data-liberation
composer install
cd ../../../
nx test:watch playground-data-liberation
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants