Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Liberation] wp_rewrite_urls() #1893

Merged
merged 32 commits into from
Oct 28, 2024

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Oct 14, 2024

Motivation for the change, related issues

A part of #1894.

Prototypes a wp_rewrite_urls() URL rewriter for block markup to migrate the content from, say, <a href="https://adamadam.blog"> to <a href="https://adamziel.com/blog">.

Status:

  • URL rewriting works to perhaps the greatest extent it ever did in WordPress migrations.
  • The URL parser requires PHP 8.1. This is fine for some Playground applications, but we'll need PHP 7.2+ compatibility to get it into WordPress core.
  • This PR features WP_HTML_Tag_Processor and WP_HTML_Processor to enable usage outside of WordPress core.

Details

This PR consists of a code ported from https://github.com/adamziel/site-transfer-protocol. It uses a cascade of parsers to pierce through the structured data in a WordPress post and replace the URLs matching the requested domain.

The data flow is as follows:

Parse HTML -> Parse block comments -> Parse attributes JSON -> Parse URLs

On a high level, this parsing cascade is handled by the WP_Block_Markup_Url_Processor class:

$p = new WP_Block_Markup_Url_Processor( $block_markup, $base_url );
while ( $p->next_url() ) {
	$parsed_matched_url = $p->get_parsed_url();
	// .. do processing
	$p->set_raw_url($new_raw_url);
}

Getting more into details, the WP_Block_Markup_Url_Processor extends the WP_HTML_Tag_Processor class and walks the block markup token by token. It then drills down into:

  • Text nodes – where matches URLs using regexps. This part can be improved to avoid regular expressions.
  • Block comments – where it parses the block attributes and iterates through them, looking for ones that contain valid URLs
  • HTML tag attributes – where it looks for ones that are reserved for URLs (such as <a href="">, looking for ones that contain valid URLs

The next_url() method moves through the stream of tokens, looking for the next match in one of the above contexts, and the set_raw_url() knows how to update each node type, e.g. block attributes updates are json_encode()-d.

Processing tricky inputs

When this code is fed into the migrator:

<!-- wp:paragraph -->
<p>
	<!-- Inline URLs are migrated -->
	🚀-science.com/science has the best scientific articles on the internet! We're also
	available via the punycode URL:
	
	<!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path -->
	&#104;ttps://xn---&#115;&#99;ience-7f85g.com/%73%63ience/.
	
	<!-- Correctly ignores similar–but–different URLs -->
	This isn't migrated: https://🚀-science.comcast/science <br>
	Or this: super-🚀-science.com/science
</p>
<!-- /wp:paragraph -->

<!-- Block attributes are migrated without any issue -->
<!-- wp:image {"src": "https:\/\/\ud83d\ude80-\u0073\u0063ience.com/%73%63ience/wp-content/image.png"} -->
<!-- As are URI HTML attributes -->
<img src="&#104;ttps://xn---&#115;&#99;ience-7f85g.com/science/wp-content/image.png">
<!-- /wp:image -->

<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>

This actual output is produced:

<!-- wp:paragraph -->
<p>
	<!-- Inline URLs are migrated -->
	science.wordpress.com has the best scientific articles on the internet! We're also
	available via the punycode URL:
	
	<!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path -->
	https://science.wordpress.com/.
	
	<!-- Correctly ignores similar–but–different URLs -->
	This isn't migrated: https://🚀-science.comcast/science <br>
	Or this: super-🚀-science.com/science
</p>
<!-- /wp:paragraph -->

<!-- Block attributes are migrated without any issue -->
<!-- wp:image {"src":"https:\/\/science.wordpress.com\/wp-content\/image.png"} -->
<!-- As are URI HTML attributes -->
<img src="https://science.wordpress.com/wp-content/image.png">
<!-- /wp:image -->

<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>

Remaining work

  • Add PHPCBF
  • Get to zero CBF errors
  • Get the unit tests to run in CI (e.g. run composer install)
  • Add relevant unit tests coverage
  • Review the API shape

Follow-up work

Testing Instructions (or ideally a Blueprint)

CI runs the PHP unit tests. To run this on your local machine, do this:

cd packages/playground/data-liberation
composer install
cd ../../../
nx test:watch playground-data-liberation

@adamziel adamziel added the [Type] Enhancement New feature or request label Oct 14, 2024
@adamziel adamziel requested a review from a team as a code owner October 14, 2024 17:55
@adamziel adamziel changed the title [Data liberation] Prototype wp_rewrite_urls() [Data liberation] wp_rewrite_urls() Oct 14, 2024
@adamziel
Copy link
Collaborator Author

I thought it won't be ready for some more time but I today landed a comfortable enough amount of unit tests to merge this PR as v1 of wp_rewrite_urls(). The API shape will likely change. This is all new code, not yet used anywhere in Playground. Let's keep building on top of it.

@adamziel adamziel merged commit e5813df into trunk Oct 28, 2024
10 checks passed
@adamziel adamziel deleted the data-liberation-bring-in-php-parsers branch October 28, 2024 23:14
adamziel added a commit that referenced this pull request Oct 28, 2024
A part of #1894.
Follows up on
#1893.

This PR brings in a few more PHP APIs that were initially explored
outside of Playground so that they can be incubated in Playground. See
the linked descriptions for more details about each API:

* XML Processor from
WordPress/wordpress-develop#6713
* Stream chain from adamziel/wxr-normalize#1
* A draft of a WXR URL Rewriter class capable of rewriting URLs in WXR
files

## Testing instructions

* Confirm the PHPUnit tests pass in CI
* Confirm the test suite looks reasonabel
* That's it for now! It's all new code that's not actually used anywhere
in Playground yet. I just want to merge it to keep iterating and
improving.
Copy link
Member

@brandonpayton brandonpayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I am catching up on reviewing merged PRs and am leaving comments in case they are valuable.

Comment on lines +201 to +209
while ( true ) {
$this->block_attributes_iterator->next();
if ( ! $this->block_attributes_iterator->valid() ) {
break;
}
return true;
}

return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this while-true loop? It looks like we might be able to simplify this to:

		$this->block_attributes_iterator->next();
		if ( $this->block_attributes_iterator->valid() ) {
			return true;
		}

		return false;

* base URL.
* When a base URL is missing, the string must start with a protocol to
* be considered a URL.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this comment.

Thinking about it led to thinking about subdirectory-based multisites and this question:

Should we have any concern for cases where a subdir multisite is moved to a different base subdir. For example, if http://earth.com/old-multisite/<blog> is moved to http://moon.com/new-multisite/<blog>, would we want to handle rewriting /old-multisite to /new-multisite?

Such URLs may or may not include the hostname.

Maybe if we support path rewriting, it will need to be an optional rewrite feature, and maybe even a separate facility because, as this comment implies, it's conceivable that there may be false-positives.

Copy link
Collaborator Author

@adamziel adamziel Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot! The good news is path rewriting is supported :) not sure if in this PR, but for sure in trunk. It won't catch everything, e.g. host-less paths in text content, but it will catch a lot.

$this->did_prepend_protocol = false;
while ( true ) {
/**
* Thick sieve – eagerly match things that look like URLs but turn out to not be URLs in the end.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@adamziel adamziel changed the title [Data liberation] wp_rewrite_urls() [Data Liberation] wp_rewrite_urls() Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Type] Enhancement New feature or request
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants