-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data Liberation] wp_rewrite_urls() #1893
Conversation
I thought it won't be ready for some more time but I today landed a comfortable enough amount of unit tests to merge this PR as v1 of |
A part of #1894. Follows up on #1893. This PR brings in a few more PHP APIs that were initially explored outside of Playground so that they can be incubated in Playground. See the linked descriptions for more details about each API: * XML Processor from WordPress/wordpress-develop#6713 * Stream chain from adamziel/wxr-normalize#1 * A draft of a WXR URL Rewriter class capable of rewriting URLs in WXR files ## Testing instructions * Confirm the PHPUnit tests pass in CI * Confirm the test suite looks reasonabel * That's it for now! It's all new code that's not actually used anywhere in Playground yet. I just want to merge it to keep iterating and improving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I am catching up on reviewing merged PRs and am leaving comments in case they are valuable.
packages/playground/data-liberation/bin/regenerate_public_suffix_list.php
Show resolved
Hide resolved
while ( true ) { | ||
$this->block_attributes_iterator->next(); | ||
if ( ! $this->block_attributes_iterator->valid() ) { | ||
break; | ||
} | ||
return true; | ||
} | ||
|
||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this while-true loop? It looks like we might be able to simplify this to:
$this->block_attributes_iterator->next();
if ( $this->block_attributes_iterator->valid() ) {
return true;
}
return false;
* base URL. | ||
* When a base URL is missing, the string must start with a protocol to | ||
* be considered a URL. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this comment.
Thinking about it led to thinking about subdirectory-based multisites and this question:
Should we have any concern for cases where a subdir multisite is moved to a different base subdir. For example, if http://earth.com/old-multisite/<blog>
is moved to http://moon.com/new-multisite/<blog>
, would we want to handle rewriting /old-multisite
to /new-multisite
?
Such URLs may or may not include the hostname.
Maybe if we support path rewriting, it will need to be an optional rewrite feature, and maybe even a separate facility because, as this comment implies, it's conceivable that there may be false-positives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot! The good news is path rewriting is supported :) not sure if in this PR, but for sure in trunk. It won't catch everything, e.g. host-less paths in text content, but it will catch a lot.
$this->did_prepend_protocol = false; | ||
while ( true ) { | ||
/** | ||
* Thick sieve – eagerly match things that look like URLs but turn out to not be URLs in the end. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Motivation for the change, related issues
A part of #1894.
Prototypes a
wp_rewrite_urls()
URL rewriter for block markup to migrate the content from, say,<a href="https://adamadam.blog">
to<a href="https://adamziel.com/blog">
.Status:
WP_HTML_Tag_Processor
andWP_HTML_Processor
to enable usage outside of WordPress core.Details
This PR consists of a code ported from https://github.com/adamziel/site-transfer-protocol. It uses a cascade of parsers to pierce through the structured data in a WordPress post and replace the URLs matching the requested domain.
The data flow is as follows:
Parse HTML -> Parse block comments -> Parse attributes JSON -> Parse URLs
On a high level, this parsing cascade is handled by the
WP_Block_Markup_Url_Processor
class:Getting more into details, the
WP_Block_Markup_Url_Processor
extends theWP_HTML_Tag_Processor
class and walks the block markup token by token. It then drills down into:<a href="">
, looking for ones that contain valid URLsThe
next_url()
method moves through the stream of tokens, looking for the next match in one of the above contexts, and theset_raw_url()
knows how to update each node type, e.g. block attributes updates arejson_encode()
-d.Processing tricky inputs
When this code is fed into the migrator:
This actual output is produced:
Remaining work
composer install
)Follow-up work
WP_HTML_Tag_Processor
in WordPress core, see HTML API: Add set_modifiable_text() for replacing text nodes. wordpress-develop#7007 (comment)WP_HTML_Tag_Processor
as a "WordPress polyfill" for standalone usage.Testing Instructions (or ideally a Blueprint)
CI runs the PHP unit tests. To run this on your local machine, do this: