Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Introduce HTML processor for server-side block modification on render #42507

Closed
wants to merge 18 commits into from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Jul 18, 2022

What

Currently provides some data structures necessary for tracking and modifying
an HTML document. No implementation code exists yet and this is mainly here
to establish system design and documentation.

Alternative implementation to associated #42485

Why?

There exists too many fragile and duplicate backend RegExp matchers to replace
individual HTML elements within block code.

How?

Provides a standard and highly-restricted HTML modification API to separate the
act of finding HTML tokens from replacing or updating HTML values.

Testing Instructions

There is nothing yet to test.

…ender

Currently provides some data structures necessary for tracking and modifying
an HTML document. No implementation code exists yet and this is mainly here
to establish system design and documentation.
@dmsnell dmsnell requested review from adamziel and gziolo July 18, 2022 14:02
* equivalence of those names (lower-cased, Unicode-normalized, etc...).
* If we're looking for "any tag" then this property will be `null`.
*
* h1...h6 are special since they are variations of the same base tag.
Copy link
Member

@gziolo gziolo Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using the list of HTML tags instead of a single value for headings? Example:

$processor->find( 'h1,h2,h3,h4' );

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considered it yes, though I would have gone ->find( [ 'h1', 'h2', 'h3', 'h4' ] )

so far this has been more of an exercise in discovering how little functionality we can expose while being useful. we discussed the idea of a list and wondered if there was any other HTML tag like the heading tags, where multiple tag names all basically relate to the same tag with a different attribute value. if there are not other cases like that I'd almost prefer not exposing a list until we know for sure we need one.

I can see where it could be useful to say "give me the div or the p with the most-important class, but then again if we limit our search for that class maybe that's enough and we don't have to check the tag.

what do you think?

performance wise I don't think it should be that much of an impact with this current design of scanning the full HTML source, since we have to traverse every tag linearly as we come to it (vs. being able to search for something like <div, which we can't do without opening up vulnerabilities for common cases of HTML produced by the editor).

still, every bit of flexibility makes me nervous unless I can see a clear need for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also ul vs ol where those tags are used for different styling.

Less related to that would be all the tags with semantic meaning used instead of div like header, main, section, article, aside or footer.

It's also hard to tell how this API would be used with custom blocks and how flexible people expect it to be. If the primary goal is to allow changing the root element of the block content, then using the most-important class would be sufficient in most of cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think most cases can be handled appropriately with a class. The outermost and first tag can be found with $processor->find( [] ) as well so that's not an obstacle.

I considered ol and ul but thought they generally are different enough in meaning that they don't overlap the same way h1 through h6 do (for example, there was plenty of debate about replacing h1 through h6 with just h and ignoring the numbered heading level when constructing document structure, something the specification allowed/mandated? but was never adopted). The same goes for the other elements; other tags have related semantics, which is often captured in the placement of an element into its content category, but these are distinct elements with distinct semantics. The confusion over whether a certain sub-sub-section should be h3 or h4 and the mess of properly styling them isn't something we face with article and main and aside I believe; I think they are qualitatively different, and again, nobody is confused by whether we should use ol or ul to represent an ordered list, because they are related but different (Gutenberg possibly is the exception in merging both kinds of lists into one block and using an attribute to distinguish them).

So maybe eventually we expand this but I'd like to try and see how far we can get without it. I haven't supported the h exception yet and maybe there's a case to make that we never do.

Copy link
Contributor

@adamziel adamziel Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gziolo the point about alternative markup depending on the attributes is spot-on! Here's an excerpt from the list block:

const tagName = ordered ? 'ol' : 'ul';

Also, the heading block may use either of the h-tags.

Already at this point we'd need to either introduce special tokens for "ol/ul" and "h1-6", or simply allow an array of tags as an argument. If we do the former, we'll introduce a "silent" dependency between the HTML parser and the block library. I vote for the latter even if it opens other possibilities.

In addition, the following blocks have a tagName attribute:

  • template-part
  • comment
  • group
  • query

Copy link
Contributor

@adamziel adamziel Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @dmsnell. I was about to counter that it would be more convenient to just query for ['ol','ul'], but it's not very future-proof. I like matching the exact tag as you proposed. In fact, we could do the same thing in the heading block since level is a block attribute as well. Even if it wasn't, we could probably just match the first tag in the markup.

Copy link
Member

@gziolo gziolo Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we necessarily need to support an option to pass an array of tags. @dmsnell made a great point that when rendering the block, attributes help to make an exact match for which tag should be processed.

I left my initial question came from the fact that I didn't fully understand why there is a need for the special h tag. It feels like we can live without both h and [ 'h1', 'h2' ] expressions for now.

Aside. If someone really needs to process all six heading types they can do it with:

$p = new WP_Tag_Thing( $html );
for ( $i = 1; $i <= 6; $i++ ) {
    while ( $p->find( [ 'tag_name' => 'h' + $i  ] ) ) {
        $p->add_class( 'todo' );
    }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gziolo not quite, this is "pull" parser which means it only moves forward. find means "Find me the next tag, starting where I finished last". In other words, this would only process the div:

$p = new WP_Tag_Thing('<span><div>');
$p->find('div')->add_class('todo');
$p->find('span')->add_class('todo');

While this would process both tags:

$p = new WP_Tag_Thing('<span><div>');
$p->find('span')->add_class('todo');
$p->find('div')->add_class('todo');

I'd like to name that method find_next, next_tag, next, or something to that effect.

Copy link
Member

@gziolo gziolo Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the following:

for ( $i = 1; $i <= 6; $i++ ) {
    $p = new WP_Tag_Thing( $html );
    while ( $p->find( [ 'tag_name' => 'h' + $i  ] ) ) {
        $p->add_class( 'todo' );
    }
    $html = (string) $p;
}

It's mainly about the general idea. I didn't run the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why there is a need for the special h tag

it's not felt yet, but I anticipated it because the h tags are unique in HTML in their purpose and realization. I don't think any other tags work the same way they do where their semantics are contextual (based on the surrounding content vs. based on the tag itself).

If someone really needs to process all six heading types they can do it with:

This isn't actually possible with the interface we've exposed here because we only go once and linearly through the document, so for example, if we look for h1 and there are none in the document, we will never find an h2 if it's there because we would have already scanned to the end.

To find any of the h tags we'd have to re-process the entire document from the beginning until we find one.

for ( $i = 1; $i <= 6; $i++ ) {
	$p = new WP_Tag_Walker( $html );
	if ( $p->find( [ 'tag_name' => "h{$i}" ] ) ) {
		return $p->add_class( 'todo' );
	}
}

for small blocks this may not be too bad, but note that if we want chain these together we'd have to sequence them.

$output = $html;
foreach ( array( 'ol', 'li' ) as $tag_name ) {
	$p = new WP_Html_Tag_Modifier( $output );
	while ( $p->find( array( 'tag_name' => $tag_name ) ) ) {
		$p->find( array( 'tag_name' => 'li' ) );
		$p->add_class( 'first-item' );
	}
	$output .= $p;
}
return $output.

@gziolo gziolo added the [Type] Technical Prototype Offers a technical exploration into an idea as an example of what's possible label Jul 21, 2022
$this->modifications->replacements[] = array(
$insert_at,
$insert_at,
' ' . self::serialize_attribute( $name, $value )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with esc_attr, but this one should be faster while doing the job just as good. Good call!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan on adding esc_attr - serialize_attribute here is somewhat of a placeholder with a name so we can go back and refine that concept. I wanted to also make some checks about URL fields and other encoding issues, for example, all of which I think would go under serialize_attribute



public function add_class( $class_name ) {
$this->modifications->class_changes[ $class_name ] = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used a similar data structure in #42485! :-)

}


public function set_attribute( $name, $value ) {
Copy link
Contributor

@adamziel adamziel Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally removed the distinction between set and add attribute in #42485 – this distinction seems more confusing than useful the consumer of this API. I'd remove the add_attribute method or at least make it private.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't intend for people to use it, but I generally avoid creating private methods and properties unless I know for certain that there's a compelling reason to do so. just practice I've learned over time of almost always regretting making things private.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I normally don't like making things private as well. However, in this case we're talking about new public APIs and I'd rather avoid locking ourselves into any specific implementation. The more public methods we expose, the more constraints we'll have when iterating on it in the future.

}


class WP_HTML_Scanner {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd merge this and WP_HTML_Processor into a single class.

True, scanning is conceptually a different operation than updating the HTML. However, they are related closely enough to require public access to the properties of this class.

A single class = minimal public API. And if we need a separate Scanner one day, it would be easy to carve one out from the monolith.

/** @var WP_HTML_Attribute_Token $existing */
$existing = $this->scanner->attributes[ $name ];

$this->modifications->replacements[] = array(
Copy link
Contributor

@adamziel adamziel Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$this->modifications->replacements reads nice, but demands introducing a WP_Tag_Modifications class only to hold two variables. We don't pass $this->modifications around, only refer to it inside of the WP_HTML_Processor. Therefore, a separate class doesn't seem that useful. I'd make $replacements and $class_changes a part of WP_HTML_Processor.



public function remove_class( $class_name ) {
$this->modifications->class_changes[ $class_name ] = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍



public function remove_attribute( $name ) {
if ( ! array_key_exists( $name, $this->scanner->attributes ) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always consumes all the attributes for the tag. I made #42485 only consume as many attributes as needed. That being said, the speed gains are likely negligible and your approach makes for a more readable code 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I want to add that lazy behavior back in at some point. when I was building this I took the shortcut to make several of the functions easier.

$this->commit_class_changes();
$this->descriptor = WP_Tag_Find_Descriptor::parse( $query );
if ( false === $this->scanner->scan( $this->descriptor ) ) {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return false here breaks the fluid interface idea and forces the consumer to always check the return value. I'm a bit bummed to see calls like $p->find()->set_attribute()->find()... go, but I guess it's the only reasonable approach. As a consumer, I wouldn't want to chain a few find() calls only to have them all fail because the first one didn't find anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely attached to the fluid interface. I could give it up if we need because I think it's valuable to use a find() inside a loop as the termination indicator.

$number = 1;
while ( $p->find( [ 'tag_name' => 'section' ] ) ) {
    $p->set_attribute( 'data-section-id', $number++ );
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is our choice:

  • A succinct fluid interface that will sometimes cause unexpected errors
  • A slightly more verbose imperative interface that won't have cause unexpected errors

And in this case, I feel comfortable letting go of the fluid interface.

Copy link
Member Author

@dmsnell dmsnell Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel comfortable letting go of the fluid interface.

That seems fine, and while fluid interfaces have their own love, they also have their own controversy. At least for what it's worth I feel like a lot of PHP code follows this idiomatic pattern which looks like a fluid interface without being one.

$x = 'some value';
$x = some_tranform( $x );
$x = another_thing( $x, $y );
$x = with_more_intensity( $x );

$p = new GoGetter();
$p->go_get_ore();
$p->refine_it();
$p->shake_it();
$p->schmelt_it();
$p->dump_it();

(the pattern being an obvious consecutive repetition of assignment-to or calling from the same variable)

}


public function find( $query ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel strongly about calling this find_next, otherwise the name suggests I could do this:

$p = new WP_HTML_Processor('<div/><span/>');
$p->find(['tag_name'=>'span'])->set_attribute('id', 'span_id');
$p->find(['tag_name'=>'div'])->set_attribute('id', 'div_id');

Which is explicitly not supported. The documentation may explain it, but suggesting these nuances in the method names will save a lot of confusion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused because both of those finds will fail. Are you saying that to you, the name find() implies that each time it starts it will rewind to the beginning of the document?

Copy link
Contributor

@adamziel adamziel Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the first find fail? It finds the first span:

$p = new WP_HTML_Processor('<div/><span/>');
$p->find(['tag_name'=>'span'])->set_attribute('id', 'span_id');
echo $p;

// <div/><span id="span_id"/>%

Are you saying that to you, the name find() implies that each time it starts it will rewind to the beginning of the document?

Yes, to me it does imply that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the first find fail? It finds the first span:

because in my head I read </div></span> and not what you wrote 🤦‍♂️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to me it does imply that.

Let's maybe not settle on this yet? I agree mostly with you, but also find find() quite reasonable for this purpose.

Copy link
Contributor

@adamziel adamziel Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could keep it open for now, start the public API proposal, and see what others say. Maybe we'll even find a third option.

Out of curiosity I googled for "XML pull parser," which is related to what we do, and found xmlpull.org that has a method called next().

Perhaps next_tag( $query ) wouldn't be a bad compromise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next() sounds pretty good actually based on how we use it, and I don't think we need _tag in there especially if we end up with tag in the class name, which I think seems more and more appropriate as we go since this is so focused on tag openers alone

}


public function commit_class_changes() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public function commit_class_changes() {
private function commit_class_changes() {

This doesn't have to be exposed as a public method.

}


public function apply() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public function apply() {
private function apply() {

This doesn't have to be exposed as a public method, our public API is __toString().

Comment on lines +162 to +168
$existing_class = array_key_exists( 'class', $this->scanner->attributes )
? substr(
$this->scanner->document,
$this->scanner->attributes['class']->value->start,
$this->scanner->attributes['class']->value->length
)
: '';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went for storing values in #42485, but that's only useful in case of class names so storing just indices sounds better 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to get rid of the value token and instead just store start and end of the attribute value itself as well as the location where that's found (including the quotes)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I've considered that for all tokens as well to avoid allocating more objects. I'm still puzzled by why we're getting any out of memory exceptions when running on big objects

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdly, #42485 uses less memory despite doing pretty much the same thing as this PR. Perhaps the memory errors here are due to some very specific weird quirk? If you can share reproduction steps, I'll see if I can pinpoint it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for historical reference: the memory use came from the fact that I was storing modifications as a three-array of [int, int, string]. when I converted to a class with start, end, and text properties the memory use dropped to 5% of what it was with the array.

Comment on lines 178 to 179
array_key_exists( $comparable_name, $this->modifications->class_changes ) &&
false === $this->modifications->class_changes[ $comparable_name ]
Copy link
Contributor

@adamziel adamziel Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
array_key_exists( $comparable_name, $this->modifications->class_changes ) &&
false === $this->modifications->class_changes[ $comparable_name ]
array_key_exists( $comparable_name, $this->modifications->class_changes )

We can remove all the classes mentioned in class_changes. This way, the code that adds the new classes a few lines later won't introduce duplicate names. Duplicates are fine, but why not avoid them if we can?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you raise an interesting point. that's very true and a good idea to do, but also, I'd like to avoid changing any HTML we don't have to. that means this code as written does neither.

but, we could remember the names we scan when removing and then only add in the ones which weren't there.

$existing_classes = array();
$without_removed_classes = preg_replace_callback(
    …,
    function ( $matches ) use ( &$existing_classes ) {
        …

        $existing_classes[ $comparable_name ] = true;

        …
    }
);

foreach ( $this->modifications->class_changes as $class => $should_include ) {
	if ( $should_include && ! array_key_exists( $existing_classes, $class ) ) {
		$new_class .= " {$class}";
	}
}

I think either case (removing all classes preemptively or only adding the missing ones) would work. From a consumer point of view I like the idea of leaving existing classes where they are if all we care about is adding them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like leaving the existing classes where they are – let's remember the names we scan then!

Ah – and it would also be great to remove the class attribute entirely if there are no classes left.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah – and it would also be great to remove the class attribute entirely if there are no classes left.
an even better idea!

public function apply() {
$document = $this->scanner->document;

$this->commit_class_changes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: commit sounds like applying class changes to the HTML. How about transform_class_changes_to_modifications?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with collapse but then favored commit since I was trying to say, "at this point no more class-based changes are allowed"

transform_class_changes_to_modifications is indeed descriptive, but also very long and verbose 🤔

Copy link
Contributor

@adamziel adamziel Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long and verbose is fine if it tells you what it does. It's called just once so it won't unnecessarily pollute the codebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long and verbose is fine if it tells you what it does.

Except when it's so long and verbose it's distracting :)

Comment on lines 213 to 217
if ( $a[0] === $b[0] ) {
return 0;
}

return $a[0] < $b[0] ? -1 : 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ( $a[0] === $b[0] ) {
return 0;
}
return $a[0] < $b[0] ? -1 : 1;
return $b[0] - $a[0];

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good! should have seen that.



public function __toString() {
return $this->apply();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd cache this result. From there, we could do two things:

  • Check if there were any modifications requested since the last caching
  • Introduce a "locked"/"finished" state in which you cannot apply any further updates anymore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not something I want to do yet for the sake of simplicity of the code, but definitely something I thought about and planned on introducing at some point (if it still makes sense then)

* }
* @return WP_Tag_Find_Descriptor Used by WP_HTML_Processor when scanning HTML.
*/
public static function parse( $query ) {
Copy link
Contributor

@adamziel adamziel Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a constructor instead of a static method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, but at the time I wasn't sure if we'd want to create them through other means. I think there are still some questions I have about the role here. for instance, I do think if this were a construction I'd want to still call it like so

public function __construct( $query = array() ) {
    $parsed = self::parse( $query );
    $this->tag_name = $parsed->tag_name;
    …
}

and at that point I don't see what the value is in having the constructor. my original intent was to make the parse function failable, and maybe there's still a point in doing that, but for now it fails to safe defaults silently.

if this is the only way we'd want to create a descriptor then maybe we just convert it to a constructor, but I think for some purposes we might imagine creating one manually

$d = new Descriptor();
$d->tag_name = '<invalid';

*
* @var array<string|callable>
*/
public $data_attribute;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is an actual use-case. We may be able to remove the data attribute matching – let's keep an eye on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's right. part of me still thinks it might be more common than we anticipate, but we should try and do some discovery and see how common this is in practice with existing plugins, if we can

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'm vouching for adding features later instead of removing them – it's easier, especially with WordPress BC policy. A discovery is a good idea 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's a good plan. I'll remove references to it.

eventually I thought it could make sense if we have to go beyond tag name and class simply to add a function for the visitor. every day this looks more and more like a traditional visitor pattern and we could simply allow for that idiom. if we have to scan the entire document up until a termination we wouldn't be losing much by calling a callback on every tag anyway

function render_my_block( $html, $block ) {
    return WP_Tag_Visitor( $html, function ( $tag ) use ( $block ) {
        if ( ! $tag->is( 'div' ) ) {
            return;
        }

        $tag->add_class( 'special-image-wrapper' );
        if ( $block['attrs']['isDisabled'] ) {
            $tag->set_attribute( 'disabled', '' );
        }

        return 'done';
    } );
}

that's far less declarative but also way more flexible. come to think of it, this could be a way to "parse HTML" without parsing HTML, because eventually we could do something like expose opening and closing tag events, plus depth. I think that's like how XmlReader and XmlWriter work as a streaming parser. If someone wants to track DOM structure they can do so in the application code space and, for example, only track as much of it as they want. for instance, maybe they only want to find the first img inside a div - they can do that while ignoring all other document structure 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, let's explore that!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmsnell, the idea you shared with all the opportunities that WP_Tag_Visitor could open is excellent. It would give maximum flexibility with minimal changes in the current implementation.

Comment on lines +258 to +307
public $tag_name;


/**
* We're looking for a tag also containing this CSS class name, up to
* the comparable equivalence of those names. If we're not looking for
* a class name this property will be `null`.
*
* @var string|null
*/
public $class_name;


/**
* We're looking for a tag that also has some data-attribute values.
* If all we care about is the presence of one of those values this
* will contain the name of that attribute, up to the comparable
* equivalence of that name. If we also want to constrain the value
* of that attribute we will have a pair containing the name of the
* attribute (to equivalence) with a predicate function. If we're
* not looking for a data-attribute this property will be `null`.
*
* Example:
* <code>
* // has a `data-wp-container` attribute
* 'data-wp-container'
*
* // has a `data-wp-container` attribute with the value `core/group`
* ['data-wp-block', 'core/group']
*
* // has a `data-test-group` attribute with an even-valued id
* ['data-test-group', function ( $group ) { return 0 === (int) $group % 2; }]
* </code>
*
* @var array<string|callable>
*/
public $data_attribute;


/**
* Used to skip matches in case we expect more than one to exist.
* This constraint applies after all other constraints have held.
* For example, to find the third `<div>` tag containing the
* `wp-block` class name set `match_offset = 3`. If not provided
* the default indication is to find the first match.
*
* @default 1
* @var int
*/
public $match_offset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public $tag_name;
/**
* We're looking for a tag also containing this CSS class name, up to
* the comparable equivalence of those names. If we're not looking for
* a class name this property will be `null`.
*
* @var string|null
*/
public $class_name;
/**
* We're looking for a tag that also has some data-attribute values.
* If all we care about is the presence of one of those values this
* will contain the name of that attribute, up to the comparable
* equivalence of that name. If we also want to constrain the value
* of that attribute we will have a pair containing the name of the
* attribute (to equivalence) with a predicate function. If we're
* not looking for a data-attribute this property will be `null`.
*
* Example:
* <code>
* // has a `data-wp-container` attribute
* 'data-wp-container'
*
* // has a `data-wp-container` attribute with the value `core/group`
* ['data-wp-block', 'core/group']
*
* // has a `data-test-group` attribute with an even-valued id
* ['data-test-group', function ( $group ) { return 0 === (int) $group % 2; }]
* </code>
*
* @var array<string|callable>
*/
public $data_attribute;
/**
* Used to skip matches in case we expect more than one to exist.
* This constraint applies after all other constraints have held.
* For example, to find the third `<div>` tag containing the
* `wp-block` class name set `match_offset = 3`. If not provided
* the default indication is to find the first match.
*
* @default 1
* @var int
*/
public $match_offset;
private $tag_name;
/**
* We're looking for a tag also containing this CSS class name, up to
* the comparable equivalence of those names. If we're not looking for
* a class name this property will be `null`.
*
* @var string|null
*/
private $class_name;
/**
* We're looking for a tag that also has some data-attribute values.
* If all we care about is the presence of one of those values this
* will contain the name of that attribute, up to the comparable
* equivalence of that name. If we also want to constrain the value
* of that attribute we will have a pair containing the name of the
* attribute (to equivalence) with a predicate function. If we're
* not looking for a data-attribute this property will be `null`.
*
* Example:
* <code>
* // has a `data-wp-container` attribute
* 'data-wp-container'
*
* // has a `data-wp-container` attribute with the value `core/group`
* ['data-wp-block', 'core/group']
*
* // has a `data-test-group` attribute with an even-valued id
* ['data-test-group', function ( $group ) { return 0 === (int) $group % 2; }]
* </code>
*
* @var array<string|callable>
*/
private $data_attribute;
/**
* Used to skip matches in case we expect more than one to exist.
* This constraint applies after all other constraints have held.
* For example, to find the third `<div>` tag containing the
* `wp-block` class name set `match_offset = 3`. If not provided
* the default indication is to find the first match.
*
* @default 1
* @var int
*/
private $match_offset;

I'd make all the properties private to limit the public API surface.

* @param $preg_match
* @return self
*/
public static function from_preg_match( $preg_match ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great shorthand

PREG_OFFSET_CAPTURE,
$this->start_at
) ) {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to reset the state if there's no match:

Suggested change
return false;
$this->start_at = null;
$this->tag_name = null;
$this->tag_start = null;
$this->attributes = array();
return false;

Otherwise the following code adds two attributes to the span:

$p = new WP_HTML_Processor( '<span/>' );
$p->find( ['tag_name'=>'span'] );
$p->set_attribute( 'attr-1', 'a' );
$p->find( ['tag_name'=>'div'] );
$p->set_attribute( 'attr-2', 'a' );
echo $p;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good catch. I will update this

Comment on lines 625 to 639
switch ( $this->document[ $this->start_at ] ) {
case '"':
$this->start_at += 1;
$pattern = '~(?P<VALUE>[^"]*)"~Smiu';
break;

case "'":
$this->start_at += 1;
$pattern = "~(?P<VALUE>[^']*)'~Smiu";
break;

default:
$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c}\x{0d} >]*)~Smiu';
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We can avoid the repetition:

Suggested change
switch ( $this->document[ $this->start_at ] ) {
case '"':
$this->start_at += 1;
$pattern = '~(?P<VALUE>[^"]*)"~Smiu';
break;
case "'":
$this->start_at += 1;
$pattern = "~(?P<VALUE>[^']*)'~Smiu";
break;
default:
$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c}\x{0d} >]*)~Smiu';
break;
}
$quote_maybe = $this->document[ $this->start_at ];
switch ( $quote_maybe ) {
case '"':
case "'":
$this->start_at += 1;
$pattern = "~(?P<VALUE>[^$quote_maybe]*)$quote_maybe~Smiu";
break;
default:
$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c}\x{0d} >]*)~Smiu';
break;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was comfortable with the repetition for the purpose of searchability within the code and also of having a static PCRE pattern. I've also considered taking different branching here instead of PCRE searches but didn't get to that yet.

Comment on lines 599 to 607
if ( 1 !== preg_match(
'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu',
$this->document,
$attribute_match,
PREG_OFFSET_CAPTURE,
$this->start_at
) ) {
return false;
}
Copy link
Contributor

@adamziel adamziel Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With three occurences of preg_match this is mostly a matter of taste, but FYI in #42485 I used a shorthand:

$attribute_match = $this->match( '~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu' );

It's possible because every call to preg_match uses the exact same of arguments aside of the regexp. $this->match also throws a No_Match exception on failure. Feel free to either ignore this or use it as an inspiration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah as noted in #42485 I was preferring to leave the duplication in to keep the preg_ functions concrete; for searchability and comprehension within the code as well as for better IDE integration.

return false;
}

list( $full_match, $start_at ) = $attribute_match[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick – this could take advantage of the token logic:

Suggested change
list( $full_match, $start_at ) = $attribute_match[0];
$full_match = WP_HTML_Scanner_Token::from_preg_match( $attribute_match[0] );
// ...
$this->start_at = $full_match->end; // end would have to be added


$this->start_at = $start_at + strlen( $full_match );
if ( '=' !== $this->document[ $this->start_at ] ) {
return new WP_HTML_Attribute_Token( $name_token, null, $name_token->start, $start_at + strlen( $full_match ) );
Copy link
Contributor

@adamziel adamziel Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar nitpick as above – it would be nice to have an end property:

Suggested change
return new WP_HTML_Attribute_Token( $name_token, null, $name_token->start, $start_at + strlen( $full_match ) );
return new WP_HTML_Attribute_Token( $name_token, null, $name_token->start, $name_token->end );

* lead us to skip over other tags and lose track of our place. So we need to search for
* _every_ tag and then check after we find one if it's the one we are looking for.
*/
"~<!--(?>.*?-->)|<!\[CDATA\[(?>.*?>)|<\?(?>.*?)>|<(?P<TAG>[a-z][^\t\x{0A}\x{0C} /?>]*)~Smui",
Copy link
Contributor

@adamziel adamziel Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doctype seems to be the only other Markup declaration open state case that we didn't account for yet – which is a good news, I worried we'd have to handle many more of these.

* lead us to skip over other tags and lose track of our place. So we need to search for
* _every_ tag and then check after we find one if it's the one we are looking for.
*/
"~<!--(?>.*?-->)|<!\[CDATA\[(?>.*?>)|<\?(?>.*?)>|<(?P<TAG>[a-z][^\t\x{0A}\x{0C} /?>]*)~Smui",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"~<!--(?>.*?-->)|<!\[CDATA\[(?>.*?>)|<\?(?>.*?)>|<(?P<TAG>[a-z][^\t\x{0A}\x{0C} /?>]*)~Smui",
"~<!--(?>.*?-->)|<!\[CDATA\[(?>.*?\]\]>)|<\?(?>.*?)>|<(?P<TAG>[a-z][^\t\x{0A}\x{0C} /?>]*)~Smui",

CDATA section ends with ]]>, not just >.

public function find_next_attribute() {
// Find the attribute name
if ( 1 !== preg_match(
'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu',
Copy link
Contributor

@adamziel adamziel Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu',
'~[\t\x{0a}\x{0c} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu',

The Before attribute name state ignores the following characters:

  • U+0009 CHARACTER TABULATION (tab)
  • U+000A LINE FEED (LF)
  • U+000C FORM FEED (FF)
  • U+0020 SPACE

But it does not mention U+000D

public function find_next_attribute() {
// Find the attribute name
if ( 1 !== preg_match(
'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu',
'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0A}x{0C} ]*)~Smiu',

The Attribute Name State switches to after attribute name on any of the following:

  • U+0009 CHARACTER TABULATION (tab)
  • U+000A LINE FEED (LF)
  • U+000C FORM FEED (FF)
  • U+0020 SPACE
  • U+002F SOLIDUS (/)
  • U+003E GREATER-THAN SIGN (>)
  • EOF

So I added the missing U+000A

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in a6b7d67


$name_token = WP_HTML_Scanner_Token::from_preg_match( $attribute_match['NAME'] );

$this->start_at = $start_at + strlen( $full_match );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to ignore the following whitespace characters around the = just as the After attribute name state
and Before attribute value state specify:

  • U+0009 CHARACTER TABULATION (tab)
  • U+000A LINE FEED (LF)
  • U+000C FORM FEED (FF)
  • U+0020 SPACE

This markup correctly defines an attribute attr with a value value:

<div attr= "value" />

However, the processor currently doesn't deal with that as expected:

$p = new WP_HTML_Processor( '<div attr = "old" />' );
$p->find( ['tag_name'=>'span'] );
$p->set_attribute( 'attr', 'updated' );
echo $p;
// <div attr="updated" = "old" />

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't remember why we had to alternations for the leading attribute name = and so I removed it, but now I think this is the very reason, to swallow the post-attribute-equal whitespace. Will update the pattern 👍

break;

default:
$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c}\x{0d} >]*)~Smiu';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c}\x{0d} >]*)~Smiu';
$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c} >]*)~Smiu';

U+000D is not on the list of characters ignored by the Attribute value (unquoted) state. These are:

  • U+0009 CHARACTER TABULATION (tab)
  • U+000A LINE FEED (LF)
  • U+000C FORM FEED (FF)
  • U+0020 SPACE

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in a6b7d67

dmsnell added 7 commits July 26, 2022 13:38
The modifications are only ever class names and textual patching and
the intermediate class wasn't as useful as it first seemed.

Co-authored by: @adamziel
A class ends up _significantly_ more efficient in terms of both memory
use and (to a lesser extent) runtime performance. In testing with the
HTML5 spec (single-page.html) adding a class and an attribute to every
tag the following was observed on my laptop at the time of writing the
patch:

|       | Runtime | MB before toString |
|-------|---------|--------------------|
| array |  2400ms |                158 |
| class |  2100ms |                  8 |
|  diff |     88% |                 5% |

While not a scientific experiment I repeated this multiple times and
the results stayed within the approximate range in the table above.
…ments.

This avoids doing a full sort at the end of the processing run,
which for large operations substantially reduces runtime.

On the `single-page.html` document, adding a class and attribute
to every tag, this took the runtime from 2100ms to 1400ms, or to
around 67% of the previous runtime.
In testing on `single-page.html` the replacements array went from
using 121 MB to 98 MB, down to 81% of its pre-patch usage. Runtime
performance wasn't notablly different.
@dmsnell
Copy link
Member Author

dmsnell commented Aug 5, 2022

Closing in favor of #42485

@dmsnell dmsnell closed this Aug 5, 2022
@dmsnell dmsnell deleted the add/html-processor branch August 5, 2022 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Type] Technical Prototype Offers a technical exploration into an idea as an example of what's possible
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants