-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Introduce HTML processor for server-side block modification on render #42507
Conversation
…ender Currently provides some data structures necessary for tracking and modifying an HTML document. No implementation code exists yet and this is mainly here to establish system design and documentation.
* equivalence of those names (lower-cased, Unicode-normalized, etc...). | ||
* If we're looking for "any tag" then this property will be `null`. | ||
* | ||
* h1...h6 are special since they are variations of the same base tag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using the list of HTML tags instead of a single value for headings? Example:
$processor->find( 'h1,h2,h3,h4' );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considered it yes, though I would have gone ->find( [ 'h1', 'h2', 'h3', 'h4' ] )
so far this has been more of an exercise in discovering how little functionality we can expose while being useful. we discussed the idea of a list and wondered if there was any other HTML tag like the heading tags, where multiple tag names all basically relate to the same tag with a different attribute value. if there are not other cases like that I'd almost prefer not exposing a list until we know for sure we need one.
I can see where it could be useful to say "give me the div
or the p
with the most-important
class, but then again if we limit our search for that class maybe that's enough and we don't have to check the tag.
what do you think?
performance wise I don't think it should be that much of an impact with this current design of scanning the full HTML source, since we have to traverse every tag linearly as we come to it (vs. being able to search for something like <div
, which we can't do without opening up vulnerabilities for common cases of HTML produced by the editor).
still, every bit of flexibility makes me nervous unless I can see a clear need for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also ul
vs ol
where those tags are used for different styling.
Less related to that would be all the tags with semantic meaning used instead of div
like header
, main
, section
, article
, aside
or footer
.
It's also hard to tell how this API would be used with custom blocks and how flexible people expect it to be. If the primary goal is to allow changing the root element of the block content, then using the most-important
class would be sufficient in most of cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think most cases can be handled appropriately with a class. The outermost and first tag can be found with $processor->find( [] )
as well so that's not an obstacle.
I considered ol
and ul
but thought they generally are different enough in meaning that they don't overlap the same way h1
through h6
do (for example, there was plenty of debate about replacing h1
through h6
with just h
and ignoring the numbered heading level when constructing document structure, something the specification allowed/mandated? but was never adopted). The same goes for the other elements; other tags have related semantics, which is often captured in the placement of an element into its content category, but these are distinct elements with distinct semantics. The confusion over whether a certain sub-sub-section should be h3
or h4
and the mess of properly styling them isn't something we face with article
and main
and aside
I believe; I think they are qualitatively different, and again, nobody is confused by whether we should use ol
or ul
to represent an ordered list, because they are related but different (Gutenberg possibly is the exception in merging both kinds of lists into one block and using an attribute to distinguish them).
So maybe eventually we expand this but I'd like to try and see how far we can get without it. I haven't supported the h
exception yet and maybe there's a case to make that we never do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gziolo the point about alternative markup depending on the attributes is spot-on! Here's an excerpt from the list block:
const tagName = ordered ? 'ol' : 'ul';
Also, the heading block may use either of the h-tags.
Already at this point we'd need to either introduce special tokens for "ol/ul" and "h1-6", or simply allow an array of tags as an argument. If we do the former, we'll introduce a "silent" dependency between the HTML parser and the block library. I vote for the latter even if it opens other possibilities.
In addition, the following blocks have a tagName
attribute:
- template-part
- comment
- group
- query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @dmsnell. I was about to counter that it would be more convenient to just query for ['ol','ul'], but it's not very future-proof. I like matching the exact tag as you proposed. In fact, we could do the same thing in the heading block since level
is a block attribute as well. Even if it wasn't, we could probably just match the first tag in the markup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we necessarily need to support an option to pass an array of tags. @dmsnell made a great point that when rendering the block, attributes help to make an exact match for which tag should be processed.
I left my initial question came from the fact that I didn't fully understand why there is a need for the special h
tag. It feels like we can live without both h
and [ 'h1', 'h2' ]
expressions for now.
Aside. If someone really needs to process all six heading types they can do it with:
$p = new WP_Tag_Thing( $html );
for ( $i = 1; $i <= 6; $i++ ) {
while ( $p->find( [ 'tag_name' => 'h' + $i ] ) ) {
$p->add_class( 'todo' );
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gziolo not quite, this is "pull" parser which means it only moves forward. find
means "Find me the next tag, starting where I finished last". In other words, this would only process the div
:
$p = new WP_Tag_Thing('<span><div>');
$p->find('div')->add_class('todo');
$p->find('span')->add_class('todo');
While this would process both tags:
$p = new WP_Tag_Thing('<span><div>');
$p->find('span')->add_class('todo');
$p->find('div')->add_class('todo');
I'd like to name that method find_next
, next_tag
, next
, or something to that effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the following:
for ( $i = 1; $i <= 6; $i++ ) {
$p = new WP_Tag_Thing( $html );
while ( $p->find( [ 'tag_name' => 'h' + $i ] ) ) {
$p->add_class( 'todo' );
}
$html = (string) $p;
}
It's mainly about the general idea. I didn't run the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why there is a need for the special h tag
it's not felt yet, but I anticipated it because the h
tags are unique in HTML in their purpose and realization. I don't think any other tags work the same way they do where their semantics are contextual (based on the surrounding content vs. based on the tag itself).
If someone really needs to process all six heading types they can do it with:
This isn't actually possible with the interface we've exposed here because we only go once and linearly through the document, so for example, if we look for h1
and there are none in the document, we will never find an h2
if it's there because we would have already scanned to the end.
To find any of the h
tags we'd have to re-process the entire document from the beginning until we find one.
for ( $i = 1; $i <= 6; $i++ ) {
$p = new WP_Tag_Walker( $html );
if ( $p->find( [ 'tag_name' => "h{$i}" ] ) ) {
return $p->add_class( 'todo' );
}
}
for small blocks this may not be too bad, but note that if we want chain these together we'd have to sequence them.
$output = $html;
foreach ( array( 'ol', 'li' ) as $tag_name ) {
$p = new WP_Html_Tag_Modifier( $output );
while ( $p->find( array( 'tag_name' => $tag_name ) ) ) {
$p->find( array( 'tag_name' => 'li' ) );
$p->add_class( 'first-item' );
}
$output .= $p;
}
return $output.
$this->modifications->replacements[] = array( | ||
$insert_at, | ||
$insert_at, | ||
' ' . self::serialize_attribute( $name, $value ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with esc_attr
, but this one should be faster while doing the job just as good. Good call!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan on adding esc_attr
- serialize_attribute
here is somewhat of a placeholder with a name so we can go back and refine that concept. I wanted to also make some checks about URL fields and other encoding issues, for example, all of which I think would go under serialize_attribute
|
||
|
||
public function add_class( $class_name ) { | ||
$this->modifications->class_changes[ $class_name ] = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used a similar data structure in #42485! :-)
} | ||
|
||
|
||
public function set_attribute( $name, $value ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally removed the distinction between set and add attribute in #42485 – this distinction seems more confusing than useful the consumer of this API. I'd remove the add_attribute
method or at least make it private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't intend for people to use it, but I generally avoid creating private
methods and properties unless I know for certain that there's a compelling reason to do so. just practice I've learned over time of almost always regretting making things private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I normally don't like making things private as well. However, in this case we're talking about new public APIs and I'd rather avoid locking ourselves into any specific implementation. The more public methods we expose, the more constraints we'll have when iterating on it in the future.
} | ||
|
||
|
||
class WP_HTML_Scanner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd merge this and WP_HTML_Processor into a single class.
True, scanning is conceptually a different operation than updating the HTML. However, they are related closely enough to require public access to the properties of this class.
A single class = minimal public API. And if we need a separate Scanner one day, it would be easy to carve one out from the monolith.
/** @var WP_HTML_Attribute_Token $existing */ | ||
$existing = $this->scanner->attributes[ $name ]; | ||
|
||
$this->modifications->replacements[] = array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$this->modifications->replacements
reads nice, but demands introducing a WP_Tag_Modifications
class only to hold two variables. We don't pass $this->modifications
around, only refer to it inside of the WP_HTML_Processor
. Therefore, a separate class doesn't seem that useful. I'd make $replacements
and $class_changes
a part of WP_HTML_Processor
.
|
||
|
||
public function remove_class( $class_name ) { | ||
$this->modifications->class_changes[ $class_name ] = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
|
||
public function remove_attribute( $name ) { | ||
if ( ! array_key_exists( $name, $this->scanner->attributes ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This always consumes all the attributes for the tag. I made #42485 only consume as many attributes as needed. That being said, the speed gains are likely negligible and your approach makes for a more readable code 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I want to add that lazy behavior back in at some point. when I was building this I took the shortcut to make several of the functions easier.
$this->commit_class_changes(); | ||
$this->descriptor = WP_Tag_Find_Descriptor::parse( $query ); | ||
if ( false === $this->scanner->scan( $this->descriptor ) ) { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return false
here breaks the fluid interface idea and forces the consumer to always check the return value. I'm a bit bummed to see calls like $p->find()->set_attribute()->find()...
go, but I guess it's the only reasonable approach. As a consumer, I wouldn't want to chain a few find()
calls only to have them all fail because the first one didn't find anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not completely attached to the fluid interface. I could give it up if we need because I think it's valuable to use a find()
inside a loop as the termination indicator.
$number = 1;
while ( $p->find( [ 'tag_name' => 'section' ] ) ) {
$p->set_attribute( 'data-section-id', $number++ );
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is our choice:
- A succinct fluid interface that will sometimes cause unexpected errors
- A slightly more verbose imperative interface that won't have cause unexpected errors
And in this case, I feel comfortable letting go of the fluid interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel comfortable letting go of the fluid interface.
That seems fine, and while fluid interfaces have their own love, they also have their own controversy. At least for what it's worth I feel like a lot of PHP code follows this idiomatic pattern which looks like a fluid interface without being one.
$x = 'some value';
$x = some_tranform( $x );
$x = another_thing( $x, $y );
$x = with_more_intensity( $x );
$p = new GoGetter();
$p->go_get_ore();
$p->refine_it();
$p->shake_it();
$p->schmelt_it();
$p->dump_it();
(the pattern being an obvious consecutive repetition of assignment-to or calling from the same variable)
} | ||
|
||
|
||
public function find( $query ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel strongly about calling this find_next
, otherwise the name suggests I could do this:
$p = new WP_HTML_Processor('<div/><span/>');
$p->find(['tag_name'=>'span'])->set_attribute('id', 'span_id');
$p->find(['tag_name'=>'div'])->set_attribute('id', 'div_id');
Which is explicitly not supported. The documentation may explain it, but suggesting these nuances in the method names will save a lot of confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused because both of those finds will fail. Are you saying that to you, the name find()
implies that each time it starts it will rewind to the beginning of the document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would the first find
fail? It finds the first span
:
$p = new WP_HTML_Processor('<div/><span/>');
$p->find(['tag_name'=>'span'])->set_attribute('id', 'span_id');
echo $p;
// <div/><span id="span_id"/>%
Are you saying that to you, the name find() implies that each time it starts it will rewind to the beginning of the document?
Yes, to me it does imply that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would the first find fail? It finds the first span:
because in my head I read </div></span>
and not what you wrote 🤦♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, to me it does imply that.
Let's maybe not settle on this yet? I agree mostly with you, but also find find()
quite reasonable for this purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could keep it open for now, start the public API proposal, and see what others say. Maybe we'll even find a third option.
Out of curiosity I googled for "XML pull parser," which is related to what we do, and found xmlpull.org that has a method called next()
.
Perhaps next_tag( $query )
wouldn't be a bad compromise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
next()
sounds pretty good actually based on how we use it, and I don't think we need _tag
in there especially if we end up with tag
in the class name, which I think seems more and more appropriate as we go since this is so focused on tag openers alone
} | ||
|
||
|
||
public function commit_class_changes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public function commit_class_changes() { | |
private function commit_class_changes() { |
This doesn't have to be exposed as a public method.
} | ||
|
||
|
||
public function apply() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public function apply() { | |
private function apply() { |
This doesn't have to be exposed as a public method, our public API is __toString()
.
$existing_class = array_key_exists( 'class', $this->scanner->attributes ) | ||
? substr( | ||
$this->scanner->document, | ||
$this->scanner->attributes['class']->value->start, | ||
$this->scanner->attributes['class']->value->length | ||
) | ||
: ''; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went for storing values in #42485, but that's only useful in case of class names so storing just indices sounds better 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to get rid of the value token and instead just store start and end of the attribute value itself as well as the location where that's found (including the quotes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and I've considered that for all tokens as well to avoid allocating more objects. I'm still puzzled by why we're getting any out of memory exceptions when running on big objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weirdly, #42485 uses less memory despite doing pretty much the same thing as this PR. Perhaps the memory errors here are due to some very specific weird quirk? If you can share reproduction steps, I'll see if I can pinpoint it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for historical reference: the memory use came from the fact that I was storing modifications as a three-array of [int, int, string]
. when I converted to a class with start
, end
, and text
properties the memory use dropped to 5% of what it was with the array.
array_key_exists( $comparable_name, $this->modifications->class_changes ) && | ||
false === $this->modifications->class_changes[ $comparable_name ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array_key_exists( $comparable_name, $this->modifications->class_changes ) && | |
false === $this->modifications->class_changes[ $comparable_name ] | |
array_key_exists( $comparable_name, $this->modifications->class_changes ) |
We can remove all the classes mentioned in class_changes
. This way, the code that adds the new classes a few lines later won't introduce duplicate names. Duplicates are fine, but why not avoid them if we can?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you raise an interesting point. that's very true and a good idea to do, but also, I'd like to avoid changing any HTML we don't have to. that means this code as written does neither.
but, we could remember the names we scan when removing and then only add in the ones which weren't there.
$existing_classes = array();
$without_removed_classes = preg_replace_callback(
…,
function ( $matches ) use ( &$existing_classes ) {
…
$existing_classes[ $comparable_name ] = true;
…
}
);
foreach ( $this->modifications->class_changes as $class => $should_include ) {
if ( $should_include && ! array_key_exists( $existing_classes, $class ) ) {
$new_class .= " {$class}";
}
}
I think either case (removing all classes preemptively or only adding the missing ones) would work. From a consumer point of view I like the idea of leaving existing classes where they are if all we care about is adding them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like leaving the existing classes where they are – let's remember the names we scan then!
Ah – and it would also be great to remove the class
attribute entirely if there are no classes left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah – and it would also be great to remove the class attribute entirely if there are no classes left.
an even better idea!
public function apply() { | ||
$document = $this->scanner->document; | ||
|
||
$this->commit_class_changes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: commit
sounds like applying class changes to the HTML. How about transform_class_changes_to_modifications
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started with collapse
but then favored commit
since I was trying to say, "at this point no more class-based changes are allowed"
transform_class_changes_to_modifications
is indeed descriptive, but also very long and verbose 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long and verbose is fine if it tells you what it does. It's called just once so it won't unnecessarily pollute the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long and verbose is fine if it tells you what it does.
Except when it's so long and verbose it's distracting :)
if ( $a[0] === $b[0] ) { | ||
return 0; | ||
} | ||
|
||
return $a[0] < $b[0] ? -1 : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ( $a[0] === $b[0] ) { | |
return 0; | |
} | |
return $a[0] < $b[0] ? -1 : 1; | |
return $b[0] - $a[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very good! should have seen that.
|
||
|
||
public function __toString() { | ||
return $this->apply(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd cache this result. From there, we could do two things:
- Check if there were any modifications requested since the last caching
- Introduce a "locked"/"finished" state in which you cannot apply any further updates anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not something I want to do yet for the sake of simplicity of the code, but definitely something I thought about and planned on introducing at some point (if it still makes sense then)
* } | ||
* @return WP_Tag_Find_Descriptor Used by WP_HTML_Processor when scanning HTML. | ||
*/ | ||
public static function parse( $query ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be a constructor instead of a static method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably, but at the time I wasn't sure if we'd want to create them through other means. I think there are still some questions I have about the role here. for instance, I do think if this were a construction I'd want to still call it like so
public function __construct( $query = array() ) {
$parsed = self::parse( $query );
$this->tag_name = $parsed->tag_name;
…
}
and at that point I don't see what the value is in having the constructor. my original intent was to make the parse
function failable, and maybe there's still a point in doing that, but for now it fails to safe defaults silently.
if this is the only way we'd want to create a descriptor then maybe we just convert it to a constructor, but I think for some purposes we might imagine creating one manually
$d = new Descriptor();
$d->tag_name = '<invalid';
* | ||
* @var array<string|callable> | ||
*/ | ||
public $data_attribute; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is an actual use-case. We may be able to remove the data attribute matching – let's keep an eye on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's right. part of me still thinks it might be more common than we anticipate, but we should try and do some discovery and see how common this is in practice with existing plugins, if we can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I'm vouching for adding features later instead of removing them – it's easier, especially with WordPress BC policy. A discovery is a good idea 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's a good plan. I'll remove references to it.
eventually I thought it could make sense if we have to go beyond tag name and class simply to add a function for the visitor. every day this looks more and more like a traditional visitor pattern and we could simply allow for that idiom. if we have to scan the entire document up until a termination we wouldn't be losing much by calling a callback on every tag anyway
function render_my_block( $html, $block ) {
return WP_Tag_Visitor( $html, function ( $tag ) use ( $block ) {
if ( ! $tag->is( 'div' ) ) {
return;
}
$tag->add_class( 'special-image-wrapper' );
if ( $block['attrs']['isDisabled'] ) {
$tag->set_attribute( 'disabled', '' );
}
return 'done';
} );
}
that's far less declarative but also way more flexible. come to think of it, this could be a way to "parse HTML" without parsing HTML, because eventually we could do something like expose opening and closing tag events, plus depth. I think that's like how XmlReader
and XmlWriter
work as a streaming parser. If someone wants to track DOM structure they can do so in the application code space and, for example, only track as much of it as they want. for instance, maybe they only want to find the first img
inside a div
- they can do that while ignoring all other document structure 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, let's explore that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmsnell, the idea you shared with all the opportunities that WP_Tag_Visitor
could open is excellent. It would give maximum flexibility with minimal changes in the current implementation.
public $tag_name; | ||
|
||
|
||
/** | ||
* We're looking for a tag also containing this CSS class name, up to | ||
* the comparable equivalence of those names. If we're not looking for | ||
* a class name this property will be `null`. | ||
* | ||
* @var string|null | ||
*/ | ||
public $class_name; | ||
|
||
|
||
/** | ||
* We're looking for a tag that also has some data-attribute values. | ||
* If all we care about is the presence of one of those values this | ||
* will contain the name of that attribute, up to the comparable | ||
* equivalence of that name. If we also want to constrain the value | ||
* of that attribute we will have a pair containing the name of the | ||
* attribute (to equivalence) with a predicate function. If we're | ||
* not looking for a data-attribute this property will be `null`. | ||
* | ||
* Example: | ||
* <code> | ||
* // has a `data-wp-container` attribute | ||
* 'data-wp-container' | ||
* | ||
* // has a `data-wp-container` attribute with the value `core/group` | ||
* ['data-wp-block', 'core/group'] | ||
* | ||
* // has a `data-test-group` attribute with an even-valued id | ||
* ['data-test-group', function ( $group ) { return 0 === (int) $group % 2; }] | ||
* </code> | ||
* | ||
* @var array<string|callable> | ||
*/ | ||
public $data_attribute; | ||
|
||
|
||
/** | ||
* Used to skip matches in case we expect more than one to exist. | ||
* This constraint applies after all other constraints have held. | ||
* For example, to find the third `<div>` tag containing the | ||
* `wp-block` class name set `match_offset = 3`. If not provided | ||
* the default indication is to find the first match. | ||
* | ||
* @default 1 | ||
* @var int | ||
*/ | ||
public $match_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public $tag_name; | |
/** | |
* We're looking for a tag also containing this CSS class name, up to | |
* the comparable equivalence of those names. If we're not looking for | |
* a class name this property will be `null`. | |
* | |
* @var string|null | |
*/ | |
public $class_name; | |
/** | |
* We're looking for a tag that also has some data-attribute values. | |
* If all we care about is the presence of one of those values this | |
* will contain the name of that attribute, up to the comparable | |
* equivalence of that name. If we also want to constrain the value | |
* of that attribute we will have a pair containing the name of the | |
* attribute (to equivalence) with a predicate function. If we're | |
* not looking for a data-attribute this property will be `null`. | |
* | |
* Example: | |
* <code> | |
* // has a `data-wp-container` attribute | |
* 'data-wp-container' | |
* | |
* // has a `data-wp-container` attribute with the value `core/group` | |
* ['data-wp-block', 'core/group'] | |
* | |
* // has a `data-test-group` attribute with an even-valued id | |
* ['data-test-group', function ( $group ) { return 0 === (int) $group % 2; }] | |
* </code> | |
* | |
* @var array<string|callable> | |
*/ | |
public $data_attribute; | |
/** | |
* Used to skip matches in case we expect more than one to exist. | |
* This constraint applies after all other constraints have held. | |
* For example, to find the third `<div>` tag containing the | |
* `wp-block` class name set `match_offset = 3`. If not provided | |
* the default indication is to find the first match. | |
* | |
* @default 1 | |
* @var int | |
*/ | |
public $match_offset; | |
private $tag_name; | |
/** | |
* We're looking for a tag also containing this CSS class name, up to | |
* the comparable equivalence of those names. If we're not looking for | |
* a class name this property will be `null`. | |
* | |
* @var string|null | |
*/ | |
private $class_name; | |
/** | |
* We're looking for a tag that also has some data-attribute values. | |
* If all we care about is the presence of one of those values this | |
* will contain the name of that attribute, up to the comparable | |
* equivalence of that name. If we also want to constrain the value | |
* of that attribute we will have a pair containing the name of the | |
* attribute (to equivalence) with a predicate function. If we're | |
* not looking for a data-attribute this property will be `null`. | |
* | |
* Example: | |
* <code> | |
* // has a `data-wp-container` attribute | |
* 'data-wp-container' | |
* | |
* // has a `data-wp-container` attribute with the value `core/group` | |
* ['data-wp-block', 'core/group'] | |
* | |
* // has a `data-test-group` attribute with an even-valued id | |
* ['data-test-group', function ( $group ) { return 0 === (int) $group % 2; }] | |
* </code> | |
* | |
* @var array<string|callable> | |
*/ | |
private $data_attribute; | |
/** | |
* Used to skip matches in case we expect more than one to exist. | |
* This constraint applies after all other constraints have held. | |
* For example, to find the third `<div>` tag containing the | |
* `wp-block` class name set `match_offset = 3`. If not provided | |
* the default indication is to find the first match. | |
* | |
* @default 1 | |
* @var int | |
*/ | |
private $match_offset; |
I'd make all the properties private to limit the public API surface.
* @param $preg_match | ||
* @return self | ||
*/ | ||
public static function from_preg_match( $preg_match ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great shorthand
PREG_OFFSET_CAPTURE, | ||
$this->start_at | ||
) ) { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to reset the state if there's no match:
return false; | |
$this->start_at = null; | |
$this->tag_name = null; | |
$this->tag_start = null; | |
$this->attributes = array(); | |
return false; |
Otherwise the following code adds two attributes to the span:
$p = new WP_HTML_Processor( '<span/>' );
$p->find( ['tag_name'=>'span'] );
$p->set_attribute( 'attr-1', 'a' );
$p->find( ['tag_name'=>'div'] );
$p->set_attribute( 'attr-2', 'a' );
echo $p;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very good catch. I will update this
switch ( $this->document[ $this->start_at ] ) { | ||
case '"': | ||
$this->start_at += 1; | ||
$pattern = '~(?P<VALUE>[^"]*)"~Smiu'; | ||
break; | ||
|
||
case "'": | ||
$this->start_at += 1; | ||
$pattern = "~(?P<VALUE>[^']*)'~Smiu"; | ||
break; | ||
|
||
default: | ||
$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c}\x{0d} >]*)~Smiu'; | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We can avoid the repetition:
switch ( $this->document[ $this->start_at ] ) { | |
case '"': | |
$this->start_at += 1; | |
$pattern = '~(?P<VALUE>[^"]*)"~Smiu'; | |
break; | |
case "'": | |
$this->start_at += 1; | |
$pattern = "~(?P<VALUE>[^']*)'~Smiu"; | |
break; | |
default: | |
$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c}\x{0d} >]*)~Smiu'; | |
break; | |
} | |
$quote_maybe = $this->document[ $this->start_at ]; | |
switch ( $quote_maybe ) { | |
case '"': | |
case "'": | |
$this->start_at += 1; | |
$pattern = "~(?P<VALUE>[^$quote_maybe]*)$quote_maybe~Smiu"; | |
break; | |
default: | |
$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c}\x{0d} >]*)~Smiu'; | |
break; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was comfortable with the repetition for the purpose of searchability within the code and also of having a static PCRE pattern. I've also considered taking different branching here instead of PCRE searches but didn't get to that yet.
if ( 1 !== preg_match( | ||
'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu', | ||
$this->document, | ||
$attribute_match, | ||
PREG_OFFSET_CAPTURE, | ||
$this->start_at | ||
) ) { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With three occurences of preg_match
this is mostly a matter of taste, but FYI in #42485 I used a shorthand:
$attribute_match = $this->match( '~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu' );
It's possible because every call to preg_match
uses the exact same of arguments aside of the regexp. $this->match
also throws a No_Match
exception on failure. Feel free to either ignore this or use it as an inspiration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah as noted in #42485 I was preferring to leave the duplication in to keep the preg_
functions concrete; for searchability and comprehension within the code as well as for better IDE integration.
return false; | ||
} | ||
|
||
list( $full_match, $start_at ) = $attribute_match[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick – this could take advantage of the token logic:
list( $full_match, $start_at ) = $attribute_match[0]; | |
$full_match = WP_HTML_Scanner_Token::from_preg_match( $attribute_match[0] ); | |
// ... | |
$this->start_at = $full_match->end; // end would have to be added |
|
||
$this->start_at = $start_at + strlen( $full_match ); | ||
if ( '=' !== $this->document[ $this->start_at ] ) { | ||
return new WP_HTML_Attribute_Token( $name_token, null, $name_token->start, $start_at + strlen( $full_match ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar nitpick as above – it would be nice to have an end
property:
return new WP_HTML_Attribute_Token( $name_token, null, $name_token->start, $start_at + strlen( $full_match ) ); | |
return new WP_HTML_Attribute_Token( $name_token, null, $name_token->start, $name_token->end ); |
* lead us to skip over other tags and lose track of our place. So we need to search for | ||
* _every_ tag and then check after we find one if it's the one we are looking for. | ||
*/ | ||
"~<!--(?>.*?-->)|<!\[CDATA\[(?>.*?>)|<\?(?>.*?)>|<(?P<TAG>[a-z][^\t\x{0A}\x{0C} /?>]*)~Smui", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doctype seems to be the only other Markup declaration open state case that we didn't account for yet – which is a good news, I worried we'd have to handle many more of these.
* lead us to skip over other tags and lose track of our place. So we need to search for | ||
* _every_ tag and then check after we find one if it's the one we are looking for. | ||
*/ | ||
"~<!--(?>.*?-->)|<!\[CDATA\[(?>.*?>)|<\?(?>.*?)>|<(?P<TAG>[a-z][^\t\x{0A}\x{0C} /?>]*)~Smui", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"~<!--(?>.*?-->)|<!\[CDATA\[(?>.*?>)|<\?(?>.*?)>|<(?P<TAG>[a-z][^\t\x{0A}\x{0C} /?>]*)~Smui", | |
"~<!--(?>.*?-->)|<!\[CDATA\[(?>.*?\]\]>)|<\?(?>.*?)>|<(?P<TAG>[a-z][^\t\x{0A}\x{0C} /?>]*)~Smui", |
public function find_next_attribute() { | ||
// Find the attribute name | ||
if ( 1 !== preg_match( | ||
'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu', | |
'~[\t\x{0a}\x{0c} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu', |
The Before attribute name state ignores the following characters:
- U+0009 CHARACTER TABULATION (tab)
- U+000A LINE FEED (LF)
- U+000C FORM FEED (FF)
- U+0020 SPACE
But it does not mention U+000D
public function find_next_attribute() { | ||
// Find the attribute name | ||
if ( 1 !== preg_match( | ||
'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu', | |
'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0A}x{0C} ]*)~Smiu', |
The Attribute Name State switches to after attribute name
on any of the following:
- U+0009 CHARACTER TABULATION (tab)
- U+000A LINE FEED (LF)
- U+000C FORM FEED (FF)
- U+0020 SPACE
- U+002F SOLIDUS (/)
- U+003E GREATER-THAN SIGN (>)
- EOF
So I added the missing U+000A
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in a6b7d67
|
||
$name_token = WP_HTML_Scanner_Token::from_preg_match( $attribute_match['NAME'] ); | ||
|
||
$this->start_at = $start_at + strlen( $full_match ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to ignore the following whitespace characters around the =
just as the After attribute name state
and Before attribute value state specify:
- U+0009 CHARACTER TABULATION (tab)
- U+000A LINE FEED (LF)
- U+000C FORM FEED (FF)
- U+0020 SPACE
This markup correctly defines an attribute attr
with a value value
:
<div attr= "value" />
However, the processor currently doesn't deal with that as expected:
$p = new WP_HTML_Processor( '<div attr = "old" />' );
$p->find( ['tag_name'=>'span'] );
$p->set_attribute( 'attr', 'updated' );
echo $p;
// <div attr="updated" = "old" />
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't remember why we had to alternations for the leading attribute name =
and so I removed it, but now I think this is the very reason, to swallow the post-attribute-equal whitespace. Will update the pattern 👍
break; | ||
|
||
default: | ||
$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c}\x{0d} >]*)~Smiu'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c}\x{0d} >]*)~Smiu'; | |
$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c} >]*)~Smiu'; |
U+000D
is not on the list of characters ignored by the Attribute value (unquoted) state. These are:
- U+0009 CHARACTER TABULATION (tab)
- U+000A LINE FEED (LF)
- U+000C FORM FEED (FF)
- U+0020 SPACE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in a6b7d67
Co-authored by: @adamziel
The modifications are only ever class names and textual patching and the intermediate class wasn't as useful as it first seemed. Co-authored by: @adamziel
A class ends up _significantly_ more efficient in terms of both memory use and (to a lesser extent) runtime performance. In testing with the HTML5 spec (single-page.html) adding a class and an attribute to every tag the following was observed on my laptop at the time of writing the patch: | | Runtime | MB before toString | |-------|---------|--------------------| | array | 2400ms | 158 | | class | 2100ms | 8 | | diff | 88% | 5% | While not a scientific experiment I repeated this multiple times and the results stayed within the approximate range in the table above.
…ments. This avoids doing a full sort at the end of the processing run, which for large operations substantially reduces runtime. On the `single-page.html` document, adding a class and attribute to every tag, this took the runtime from 2100ms to 1400ms, or to around 67% of the previous runtime.
In testing on `single-page.html` the replacements array went from using 121 MB to 98 MB, down to 81% of its pre-patch usage. Runtime performance wasn't notablly different.
Closing in favor of #42485 |
What
Currently provides some data structures necessary for tracking and modifying
an HTML document. No implementation code exists yet and this is mainly here
to establish system design and documentation.
Alternative implementation to associated #42485
Why?
There exists too many fragile and duplicate backend RegExp matchers to replace
individual HTML elements within block code.
How?
Provides a standard and highly-restricted HTML modification API to separate the
act of finding HTML tokens from replacing or updating HTML values.
Testing Instructions
There is nothing yet to test.