WIP: Introduce HTML processor for server-side block modification on render #42507

dmsnell · 2022-07-18T14:02:04Z

What

Currently provides some data structures necessary for tracking and modifying
an HTML document. No implementation code exists yet and this is mainly here
to establish system design and documentation.

Alternative implementation to associated #42485

Why?

There exists too many fragile and duplicate backend RegExp matchers to replace
individual HTML elements within block code.

How?

Provides a standard and highly-restricted HTML modification API to separate the
act of finding HTML tokens from replacing or updating HTML values.

Testing Instructions

There is nothing yet to test.

…ender Currently provides some data structures necessary for tracking and modifying an HTML document. No implementation code exists yet and this is mainly here to establish system design and documentation.

…5 spec

gziolo · 2022-07-21T13:30:59Z

lib/experimental/class-wp-html-processor.php

+	 * equivalence of those names (lower-cased, Unicode-normalized, etc...).
+	 * If we're looking for "any tag" then this property will be `null`.
+	 *
+	 * h1...h6 are special since they are variations of the same base tag.


Have you considered using the list of HTML tags instead of a single value for headings? Example:

$processor->find( 'h1,h2,h3,h4' );

considered it yes, though I would have gone ->find( [ 'h1', 'h2', 'h3', 'h4' ] )

so far this has been more of an exercise in discovering how little functionality we can expose while being useful. we discussed the idea of a list and wondered if there was any other HTML tag like the heading tags, where multiple tag names all basically relate to the same tag with a different attribute value. if there are not other cases like that I'd almost prefer not exposing a list until we know for sure we need one.

I can see where it could be useful to say "give me the div or the p with the most-important class, but then again if we limit our search for that class maybe that's enough and we don't have to check the tag.

what do you think?

performance wise I don't think it should be that much of an impact with this current design of scanning the full HTML source, since we have to traverse every tag linearly as we come to it (vs. being able to search for something like <div, which we can't do without opening up vulnerabilities for common cases of HTML produced by the editor).

still, every bit of flexibility makes me nervous unless I can see a clear need for it.

There is also ul vs ol where those tags are used for different styling.

Less related to that would be all the tags with semantic meaning used instead of div like header, main, section, article, aside or footer.

It's also hard to tell how this API would be used with custom blocks and how flexible people expect it to be. If the primary goal is to allow changing the root element of the block content, then using the most-important class would be sufficient in most of cases.

Yeah I think most cases can be handled appropriately with a class. The outermost and first tag can be found with $processor->find( [] ) as well so that's not an obstacle.

I considered ol and ul but thought they generally are different enough in meaning that they don't overlap the same way h1 through h6 do (for example, there was plenty of debate about replacing h1 through h6 with just h and ignoring the numbered heading level when constructing document structure, something the specification allowed/mandated? but was never adopted). The same goes for the other elements; other tags have related semantics, which is often captured in the placement of an element into its content category, but these are distinct elements with distinct semantics. The confusion over whether a certain sub-sub-section should be h3 or h4 and the mess of properly styling them isn't something we face with article and main and aside I believe; I think they are qualitatively different, and again, nobody is confused by whether we should use ol or ul to represent an ordered list, because they are related but different (Gutenberg possibly is the exception in merging both kinds of lists into one block and using an attribute to distinguish them).

So maybe eventually we expand this but I'd like to try and see how far we can get without it. I haven't supported the h exception yet and maybe there's a case to make that we never do.

@gziolo the point about alternative markup depending on the attributes is spot-on! Here's an excerpt from the list block:

const tagName = ordered ? 'ol' : 'ul';

Also, the heading block may use either of the h-tags.

Already at this point we'd need to either introduce special tokens for "ol/ul" and "h1-6", or simply allow an array of tags as an argument. If we do the former, we'll introduce a "silent" dependency between the HTML parser and the block library. I vote for the latter even if it opens other possibilities.

In addition, the following blocks have a tagName attribute:

template-part

comment

group

query

Good point @dmsnell. I was about to counter that it would be more convenient to just query for ['ol','ul'], but it's not very future-proof. I like matching the exact tag as you proposed. In fact, we could do the same thing in the heading block since level is a block attribute as well. Even if it wasn't, we could probably just match the first tag in the markup.

I don't think we necessarily need to support an option to pass an array of tags. @dmsnell made a great point that when rendering the block, attributes help to make an exact match for which tag should be processed.

I left my initial question came from the fact that I didn't fully understand why there is a need for the special h tag. It feels like we can live without both h and [ 'h1', 'h2' ] expressions for now.

Aside. If someone really needs to process all six heading types they can do it with:

$p = new WP_Tag_Thing( $html ); for ( $i = 1; $i <= 6; $i++ ) { while ( $p->find( [ 'tag_name' => 'h' + $i ] ) ) { $p->add_class( 'todo' ); } }

@gziolo not quite, this is "pull" parser which means it only moves forward. find means "Find me the next tag, starting where I finished last". In other words, this would only process the div:

$p = new WP_Tag_Thing('<span><div>'); $p->find('div')->add_class('todo'); $p->find('span')->add_class('todo');

While this would process both tags:

$p = new WP_Tag_Thing('<span><div>'); $p->find('span')->add_class('todo'); $p->find('div')->add_class('todo');

I'd like to name that method find_next, next_tag, next, or something to that effect.

How about the following:

for ( $i = 1; $i <= 6; $i++ ) { $p = new WP_Tag_Thing( $html ); while ( $p->find( [ 'tag_name' => 'h' + $i ] ) ) { $p->add_class( 'todo' ); } $html = (string) $p; }

It's mainly about the general idea. I didn't run the code.

why there is a need for the special h tag

it's not felt yet, but I anticipated it because the h tags are unique in HTML in their purpose and realization. I don't think any other tags work the same way they do where their semantics are contextual (based on the surrounding content vs. based on the tag itself).

If someone really needs to process all six heading types they can do it with:

This isn't actually possible with the interface we've exposed here because we only go once and linearly through the document, so for example, if we look for h1 and there are none in the document, we will never find an h2 if it's there because we would have already scanned to the end.

To find any of the h tags we'd have to re-process the entire document from the beginning until we find one.

for ( $i = 1; $i <= 6; $i++ ) { $p = new WP_Tag_Walker( $html ); if ( $p->find( [ 'tag_name' => "h{$i}" ] ) ) { return $p->add_class( 'todo' ); } }

for small blocks this may not be too bad, but note that if we want chain these together we'd have to sequence them.

$output = $html; foreach ( array( 'ol', 'li' ) as $tag_name ) { $p = new WP_Html_Tag_Modifier( $output ); while ( $p->find( array( 'tag_name' => $tag_name ) ) ) { $p->find( array( 'tag_name' => 'li' ) ); $p->add_class( 'first-item' ); } $output .= $p; } return $output.

adamziel · 2022-07-26T08:24:18Z

lib/experimental/class-wp-html-processor.php

+		$this->modifications->replacements[] = array(
+			$insert_at,
+			$insert_at,
+			' ' . self::serialize_attribute( $name, $value )


I went with esc_attr, but this one should be faster while doing the job just as good. Good call!

I plan on adding esc_attr - serialize_attribute here is somewhat of a placeholder with a name so we can go back and refine that concept. I wanted to also make some checks about URL fields and other encoding issues, for example, all of which I think would go under serialize_attribute

adamziel · 2022-07-26T08:24:56Z

lib/experimental/class-wp-html-processor.php

+
+
+	public function add_class( $class_name ) {
+		$this->modifications->class_changes[ $class_name ] = true;


I used a similar data structure in #42485! :-)

adamziel · 2022-07-26T08:26:52Z

lib/experimental/class-wp-html-processor.php

+	}
+
+
+	public function set_attribute( $name, $value ) {


I intentionally removed the distinction between set and add attribute in #42485 – this distinction seems more confusing than useful the consumer of this API. I'd remove the add_attribute method or at least make it private.

I don't intend for people to use it, but I generally avoid creating private methods and properties unless I know for certain that there's a compelling reason to do so. just practice I've learned over time of almost always regretting making things private.

I normally don't like making things private as well. However, in this case we're talking about new public APIs and I'd rather avoid locking ourselves into any specific implementation. The more public methods we expose, the more constraints we'll have when iterating on it in the future.

adamziel · 2022-07-26T08:36:09Z

lib/experimental/class-wp-html-processor.php

+}
+
+
+class WP_HTML_Scanner {


I'd merge this and WP_HTML_Processor into a single class.

True, scanning is conceptually a different operation than updating the HTML. However, they are related closely enough to require public access to the properties of this class.

A single class = minimal public API. And if we need a separate Scanner one day, it would be easy to carve one out from the monolith.

adamziel · 2022-07-26T08:40:09Z

lib/experimental/class-wp-html-processor.php

+		/** @var WP_HTML_Attribute_Token $existing */
+		$existing = $this->scanner->attributes[ $name ];
+
+		$this->modifications->replacements[] = array(


$this->modifications->replacements reads nice, but demands introducing a WP_Tag_Modifications class only to hold two variables. We don't pass $this->modifications around, only refer to it inside of the WP_HTML_Processor. Therefore, a separate class doesn't seem that useful. I'd make $replacements and $class_changes a part of WP_HTML_Processor.

adamziel · 2022-07-26T08:41:44Z

lib/experimental/class-wp-html-processor.php

+
+
+	public function remove_class( $class_name ) {
+		$this->modifications->class_changes[ $class_name ] = false;


adamziel · 2022-07-26T08:51:58Z

lib/experimental/class-wp-html-processor.php

+
+
+	public function remove_attribute( $name ) {
+		if ( ! array_key_exists( $name, $this->scanner->attributes ) ) {


This always consumes all the attributes for the tag. I made #42485 only consume as many attributes as needed. That being said, the speed gains are likely negligible and your approach makes for a more readable code 👍

yeah I want to add that lazy behavior back in at some point. when I was building this I took the shortcut to make several of the functions easier.

adamziel · 2022-07-26T08:55:48Z

lib/experimental/class-wp-html-processor.php

+		$this->commit_class_changes();
+		$this->descriptor = WP_Tag_Find_Descriptor::parse( $query );
+		if ( false === $this->scanner->scan( $this->descriptor ) ) {
+			return false;


return false here breaks the fluid interface idea and forces the consumer to always check the return value. I'm a bit bummed to see calls like $p->find()->set_attribute()->find()... go, but I guess it's the only reasonable approach. As a consumer, I wouldn't want to chain a few find() calls only to have them all fail because the first one didn't find anything.

I'm not completely attached to the fluid interface. I could give it up if we need because I think it's valuable to use a find() inside a loop as the termination indicator.

$number = 1; while ( $p->find( [ 'tag_name' => 'section' ] ) ) { $p->set_attribute( 'data-section-id', $number++ ); }

I believe this is our choice:

A succinct fluid interface that will sometimes cause unexpected errors

A slightly more verbose imperative interface that won't have cause unexpected errors

And in this case, I feel comfortable letting go of the fluid interface.

I feel comfortable letting go of the fluid interface.

That seems fine, and while fluid interfaces have their own love, they also have their own controversy. At least for what it's worth I feel like a lot of PHP code follows this idiomatic pattern which looks like a fluid interface without being one.

$x = 'some value'; $x = some_tranform( $x ); $x = another_thing( $x, $y ); $x = with_more_intensity( $x ); $p = new GoGetter(); $p->go_get_ore(); $p->refine_it(); $p->shake_it(); $p->schmelt_it(); $p->dump_it();

(the pattern being an obvious consecutive repetition of assignment-to or calling from the same variable)

adamziel · 2022-07-26T08:58:52Z

lib/experimental/class-wp-html-processor.php

+	}
+
+
+	public function find( $query ) {


I feel strongly about calling this find_next, otherwise the name suggests I could do this:

$p = new WP_HTML_Processor('<div/><span/>'); $p->find(['tag_name'=>'span'])->set_attribute('id', 'span_id'); $p->find(['tag_name'=>'div'])->set_attribute('id', 'div_id');

Which is explicitly not supported. The documentation may explain it, but suggesting these nuances in the method names will save a lot of confusion.

I'm confused because both of those finds will fail. Are you saying that to you, the name find() implies that each time it starts it will rewind to the beginning of the document?

Why would the first find fail? It finds the first span:

$p = new WP_HTML_Processor('<div/><span/>'); $p->find(['tag_name'=>'span'])->set_attribute('id', 'span_id'); echo $p; // <div/><span id="span_id"/>%

Are you saying that to you, the name find() implies that each time it starts it will rewind to the beginning of the document?

Yes, to me it does imply that.

Why would the first find fail? It finds the first span:

because in my head I read </div></span> and not what you wrote 🤦‍♂️

Yes, to me it does imply that.

Let's maybe not settle on this yet? I agree mostly with you, but also find find() quite reasonable for this purpose.

We could keep it open for now, start the public API proposal, and see what others say. Maybe we'll even find a third option.

Out of curiosity I googled for "XML pull parser," which is related to what we do, and found xmlpull.org that has a method called next().

Perhaps next_tag( $query ) wouldn't be a bad compromise?

next() sounds pretty good actually based on how we use it, and I don't think we need _tag in there especially if we end up with tag in the class name, which I think seems more and more appropriate as we go since this is so focused on tag openers alone

adamziel · 2022-07-26T09:00:17Z

lib/experimental/class-wp-html-processor.php

+	}
+
+
+	public function commit_class_changes() {


Suggested change

public function commit_class_changes() {

private function commit_class_changes() {

This doesn't have to be exposed as a public method.

adamziel · 2022-07-26T09:01:07Z

lib/experimental/class-wp-html-processor.php

+	}
+
+
+	public function apply() {


Suggested change

public function apply() {

private function apply() {

This doesn't have to be exposed as a public method, our public API is __toString().

adamziel · 2022-07-26T09:05:51Z

lib/experimental/class-wp-html-processor.php

+		$existing_class = array_key_exists( 'class', $this->scanner->attributes )
+			? substr(
+				$this->scanner->document,
+				$this->scanner->attributes['class']->value->start,
+				$this->scanner->attributes['class']->value->length
+			)
+			: '';


I went for storing values in #42485, but that's only useful in case of class names so storing just indices sounds better 👍

I want to get rid of the value token and instead just store start and end of the attribute value itself as well as the location where that's found (including the quotes)

and I've considered that for all tokens as well to avoid allocating more objects. I'm still puzzled by why we're getting any out of memory exceptions when running on big objects

Weirdly, #42485 uses less memory despite doing pretty much the same thing as this PR. Perhaps the memory errors here are due to some very specific weird quirk? If you can share reproduction steps, I'll see if I can pinpoint it.

for historical reference: the memory use came from the fact that I was storing modifications as a three-array of [int, int, string]. when I converted to a class with start, end, and text properties the memory use dropped to 5% of what it was with the array.

adamziel · 2022-07-26T09:14:20Z

lib/experimental/class-wp-html-processor.php

+					 array_key_exists( $comparable_name, $this->modifications->class_changes ) &&
+					 false === $this->modifications->class_changes[ $comparable_name ]


Suggested change

array_key_exists( $comparable_name, $this->modifications->class_changes ) &&

false === $this->modifications->class_changes[ $comparable_name ]

array_key_exists( $comparable_name, $this->modifications->class_changes )

We can remove all the classes mentioned in class_changes. This way, the code that adds the new classes a few lines later won't introduce duplicate names. Duplicates are fine, but why not avoid them if we can?

you raise an interesting point. that's very true and a good idea to do, but also, I'd like to avoid changing any HTML we don't have to. that means this code as written does neither.

but, we could remember the names we scan when removing and then only add in the ones which weren't there.

$existing_classes = array(); $without_removed_classes = preg_replace_callback( …, function ( $matches ) use ( &$existing_classes ) { … $existing_classes[ $comparable_name ] = true; … } ); foreach ( $this->modifications->class_changes as $class => $should_include ) { if ( $should_include && ! array_key_exists( $existing_classes, $class ) ) { $new_class .= " {$class}"; } }

I think either case (removing all classes preemptively or only adding the missing ones) would work. From a consumer point of view I like the idea of leaving existing classes where they are if all we care about is adding them.

I like leaving the existing classes where they are – let's remember the names we scan then!

Ah – and it would also be great to remove the class attribute entirely if there are no classes left.

Ah – and it would also be great to remove the class attribute entirely if there are no classes left.
an even better idea!

adamziel · 2022-07-26T09:16:44Z

lib/experimental/class-wp-html-processor.php

+	public function apply() {
+		$document = $this->scanner->document;
+
+		$this->commit_class_changes();


Nitpick: commit sounds like applying class changes to the HTML. How about transform_class_changes_to_modifications?

I started with collapse but then favored commit since I was trying to say, "at this point no more class-based changes are allowed"

transform_class_changes_to_modifications is indeed descriptive, but also very long and verbose 🤔

Long and verbose is fine if it tells you what it does. It's called just once so it won't unnecessarily pollute the codebase.

Long and verbose is fine if it tells you what it does.

Except when it's so long and verbose it's distracting :)

adamziel · 2022-07-26T09:25:44Z

lib/experimental/class-wp-html-processor.php

+			if ( $a[0] === $b[0] ) {
+				return 0;
+			}
+
+			return $a[0] < $b[0] ? -1 : 1;


Suggested change

if ( $a[0] === $b[0] ) {

return 0;

}

return $a[0] < $b[0] ? -1 : 1;

return $b[0] - $a[0];

very good! should have seen that.

adamziel · 2022-07-26T09:32:50Z

lib/experimental/class-wp-html-processor.php

+
+
+	public function __toString() {
+		return $this->apply();


I'd cache this result. From there, we could do two things:

Check if there were any modifications requested since the last caching

Introduce a "locked"/"finished" state in which you cannot apply any further updates anymore

not something I want to do yet for the sake of simplicity of the code, but definitely something I thought about and planned on introducing at some point (if it still makes sense then)

adamziel · 2022-07-26T09:39:26Z

lib/experimental/class-wp-html-processor.php

+	 * }
+	 * @return WP_Tag_Find_Descriptor Used by WP_HTML_Processor when scanning HTML.
+	 */
+	public static function parse( $query ) {


Could this be a constructor instead of a static method?

probably, but at the time I wasn't sure if we'd want to create them through other means. I think there are still some questions I have about the role here. for instance, I do think if this were a construction I'd want to still call it like so

public function __construct( $query = array() ) { $parsed = self::parse( $query ); $this->tag_name = $parsed->tag_name; … }

and at that point I don't see what the value is in having the constructor. my original intent was to make the parse function failable, and maybe there's still a point in doing that, but for now it fails to safe defaults silently.

if this is the only way we'd want to create a descriptor then maybe we just convert it to a constructor, but I think for some purposes we might imagine creating one manually

$d = new Descriptor(); $d->tag_name = '<invalid';

adamziel · 2022-07-26T09:41:05Z

lib/experimental/class-wp-html-processor.php

+	 *
+	 * @var array<string|callable>
+	 */
+	public $data_attribute;


I'm not sure if this is an actual use-case. We may be able to remove the data attribute matching – let's keep an eye on it.

I think that's right. part of me still thinks it might be more common than we anticipate, but we should try and do some discovery and see how common this is in practice with existing plugins, if we can

In general, I'm vouching for adding features later instead of removing them – it's easier, especially with WordPress BC policy. A discovery is a good idea 👍

yeah that's a good plan. I'll remove references to it.

eventually I thought it could make sense if we have to go beyond tag name and class simply to add a function for the visitor. every day this looks more and more like a traditional visitor pattern and we could simply allow for that idiom. if we have to scan the entire document up until a termination we wouldn't be losing much by calling a callback on every tag anyway

function render_my_block( $html, $block ) { return WP_Tag_Visitor( $html, function ( $tag ) use ( $block ) { if ( ! $tag->is( 'div' ) ) { return; } $tag->add_class( 'special-image-wrapper' ); if ( $block['attrs']['isDisabled'] ) { $tag->set_attribute( 'disabled', '' ); } return 'done'; } ); }

that's far less declarative but also way more flexible. come to think of it, this could be a way to "parse HTML" without parsing HTML, because eventually we could do something like expose opening and closing tag events, plus depth. I think that's like how XmlReader and XmlWriter work as a streaming parser. If someone wants to track DOM structure they can do so in the application code space and, for example, only track as much of it as they want. for instance, maybe they only want to find the first img inside a div - they can do that while ignoring all other document structure 🤔

This is great, let's explore that!

@dmsnell, the idea you shared with all the opportunities that WP_Tag_Visitor could open is excellent. It would give maximum flexibility with minimal changes in the current implementation.

adamziel · 2022-07-26T09:41:52Z

lib/experimental/class-wp-html-processor.php

+	public $tag_name;
+
+
+	/**
+	 * We're looking for a tag also containing this CSS class name, up to
+	 * the comparable equivalence of those names. If we're not looking for
+	 * a class name this property will be `null`.
+	 *
+	 * @var string|null
+	 */
+	public $class_name;
+
+
+	/**
+	 * We're looking for a tag that also has some data-attribute values.
+	 * If all we care about is the presence of one of those values this
+	 * will contain the name of that attribute, up to the comparable
+	 * equivalence of that name. If we also want to constrain the value
+	 * of that attribute we will have a pair containing the name of the
+	 * attribute (to equivalence) with a predicate function. If we're
+	 * not looking for a data-attribute this property will be `null`.
+	 *
+	 * Example:
+	 * <code>
+	 *     // has a `data-wp-container` attribute
+	 *     'data-wp-container'
+	 *
+	 *     // has a `data-wp-container` attribute with the value `core/group`
+	 *     ['data-wp-block', 'core/group']
+	 *
+	 *     // has a `data-test-group` attribute with an even-valued id
+	 *     ['data-test-group', function ( $group ) { return 0 === (int) $group % 2; }]
+	 * </code>
+	 *
+	 * @var array<string|callable>
+	 */
+	public $data_attribute;
+
+
+	/**
+	 * Used to skip matches in case we expect more than one to exist.
+	 * This constraint applies after all other constraints have held.
+	 * For example, to find the third `<div>` tag containing the
+	 * `wp-block` class name set `match_offset = 3`. If not provided
+	 * the default indication is to find the first match.
+	 *
+	 * @default 1
+	 * @var int
+	 */
+	public $match_offset;


Suggested change

public $tag_name;

/**

* We're looking for a tag also containing this CSS class name, up to

* the comparable equivalence of those names. If we're not looking for

* a class name this property will be `null`.

*

* @var string|null

*/

public $class_name;

/**

* We're looking for a tag that also has some data-attribute values.

* If all we care about is the presence of one of those values this

* will contain the name of that attribute, up to the comparable

* equivalence of that name. If we also want to constrain the value

* of that attribute we will have a pair containing the name of the

* attribute (to equivalence) with a predicate function. If we're

* not looking for a data-attribute this property will be `null`.

*

* Example:

* <code>

* // has a `data-wp-container` attribute

* 'data-wp-container'

*

* // has a `data-wp-container` attribute with the value `core/group`

* ['data-wp-block', 'core/group']

*

* // has a `data-test-group` attribute with an even-valued id

* ['data-test-group', function ( $group ) { return 0 === (int) $group % 2; }]

* </code>

*

* @var array<string|callable>

*/

public $data_attribute;

/**

* Used to skip matches in case we expect more than one to exist.

* This constraint applies after all other constraints have held.

* For example, to find the third `<div>` tag containing the

* `wp-block` class name set `match_offset = 3`. If not provided

* the default indication is to find the first match.

*

* @default 1

* @var int

*/

public $match_offset;

private $tag_name;

/**

* We're looking for a tag also containing this CSS class name, up to

* the comparable equivalence of those names. If we're not looking for

* a class name this property will be `null`.

*

* @var string|null

*/

private $class_name;

/**

* We're looking for a tag that also has some data-attribute values.

* If all we care about is the presence of one of those values this

* will contain the name of that attribute, up to the comparable

* equivalence of that name. If we also want to constrain the value

* of that attribute we will have a pair containing the name of the

* attribute (to equivalence) with a predicate function. If we're

* not looking for a data-attribute this property will be `null`.

*

* Example:

* <code>

* // has a `data-wp-container` attribute

* 'data-wp-container'

*

* // has a `data-wp-container` attribute with the value `core/group`

* ['data-wp-block', 'core/group']

*

* // has a `data-test-group` attribute with an even-valued id

* ['data-test-group', function ( $group ) { return 0 === (int) $group % 2; }]

* </code>

*

* @var array<string|callable>

*/

private $data_attribute;

/**

* Used to skip matches in case we expect more than one to exist.

* This constraint applies after all other constraints have held.

* For example, to find the third `<div>` tag containing the

* `wp-block` class name set `match_offset = 3`. If not provided

* the default indication is to find the first match.

*

* @default 1

* @var int

*/

private $match_offset;

I'd make all the properties private to limit the public API surface.

adamziel · 2022-07-26T09:48:02Z

lib/experimental/class-wp-html-processor.php

+	 * @param $preg_match
+	 * @return self
+	 */
+	public static function from_preg_match( $preg_match ) {


Great shorthand

adamziel · 2022-07-26T09:51:14Z

lib/experimental/class-wp-html-processor.php

+			PREG_OFFSET_CAPTURE,
+			$this->start_at
+		) ) {
+			return false;


We need to reset the state if there's no match:

Suggested change

return false;

$this->start_at = null;

$this->tag_name = null;

$this->tag_start = null;

$this->attributes = array();

return false;

Otherwise the following code adds two attributes to the span:

$p = new WP_HTML_Processor( '<span/>' ); $p->find( ['tag_name'=>'span'] ); $p->set_attribute( 'attr-1', 'a' ); $p->find( ['tag_name'=>'div'] ); $p->set_attribute( 'attr-2', 'a' ); echo $p;

very good catch. I will update this

adamziel · 2022-07-26T09:52:47Z

lib/experimental/class-wp-html-processor.php

+		switch ( $this->document[ $this->start_at ] ) {
+			case '"':
+				$this->start_at += 1;
+				$pattern = '~(?P<VALUE>[^"]*)"~Smiu';
+				break;
+
+			case "'":
+				$this->start_at += 1;
+				$pattern = "~(?P<VALUE>[^']*)'~Smiu";
+				break;
+
+			default:
+				$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c}\x{0d} >]*)~Smiu';
+				break;
+		}


Nit: We can avoid the repetition:

Suggested change

switch ( $this->document[ $this->start_at ] ) {

case '"':

$this->start_at += 1;

$pattern = '~(?P<VALUE>[^"]*)"~Smiu';

break;

case "'":

$this->start_at += 1;

$pattern = "~(?P<VALUE>[^']*)'~Smiu";

break;

default:

$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c}\x{0d} >]*)~Smiu';

break;

}

$quote_maybe = $this->document[ $this->start_at ];

switch ( $quote_maybe ) {

case '"':

case "'":

$this->start_at += 1;

$pattern = "~(?P<VALUE>[^$quote_maybe]*)$quote_maybe~Smiu";

break;

default:

$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c}\x{0d} >]*)~Smiu';

break;

}

I was comfortable with the repetition for the purpose of searchability within the code and also of having a static PCRE pattern. I've also considered taking different branching here instead of PCRE searches but didn't get to that yet.

adamziel · 2022-07-26T09:56:14Z

lib/experimental/class-wp-html-processor.php

+		if ( 1 !== preg_match(
+			'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu',
+			$this->document,
+			$attribute_match,
+			PREG_OFFSET_CAPTURE,
+			$this->start_at
+		) ) {
+			return false;
+		}


With three occurences of preg_match this is mostly a matter of taste, but FYI in #42485 I used a shorthand:

$attribute_match = $this->match( '~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu' );

It's possible because every call to preg_match uses the exact same of arguments aside of the regexp. $this->match also throws a No_Match exception on failure. Feel free to either ignore this or use it as an inspiration.

yeah as noted in #42485 I was preferring to leave the duplication in to keep the preg_ functions concrete; for searchability and comprehension within the code as well as for better IDE integration.

adamziel · 2022-07-26T10:00:48Z

lib/experimental/class-wp-html-processor.php

+			return false;
+		}
+
+		list( $full_match, $start_at ) = $attribute_match[0];


Nitpick – this could take advantage of the token logic:

Suggested change

list( $full_match, $start_at ) = $attribute_match[0];

$full_match = WP_HTML_Scanner_Token::from_preg_match( $attribute_match[0] );

// ...

$this->start_at = $full_match->end; // end would have to be added

adamziel · 2022-07-26T10:01:41Z

lib/experimental/class-wp-html-processor.php

+
+		$this->start_at = $start_at + strlen( $full_match );
+		if ( '=' !== $this->document[ $this->start_at ] ) {
+			return new WP_HTML_Attribute_Token( $name_token, null, $name_token->start, $start_at + strlen( $full_match ) );


Similar nitpick as above – it would be nice to have an end property:

Suggested change

return new WP_HTML_Attribute_Token( $name_token, null, $name_token->start, $start_at + strlen( $full_match ) );

return new WP_HTML_Attribute_Token( $name_token, null, $name_token->start, $name_token->end );

adamziel · 2022-07-26T10:10:58Z

lib/experimental/class-wp-html-processor.php

+			 * lead us to skip over other tags and lose track of our place. So we need to search for
+			 * _every_ tag and then check after we find one if it's the one we are looking for.
+			 */
+	"~<!--(?>.*?-->)|<!\[CDATA\[(?>.*?>)|<\?(?>.*?)>|<(?P<TAG>[a-z][^\t\x{0A}\x{0C} /?>]*)~Smui",


Doctype seems to be the only other Markup declaration open state case that we didn't account for yet – which is a good news, I worried we'd have to handle many more of these.

adamziel · 2022-07-26T10:14:35Z

lib/experimental/class-wp-html-processor.php

+			 * lead us to skip over other tags and lose track of our place. So we need to search for
+			 * _every_ tag and then check after we find one if it's the one we are looking for.
+			 */
+	"~<!--(?>.*?-->)|<!\[CDATA\[(?>.*?>)|<\?(?>.*?)>|<(?P<TAG>[a-z][^\t\x{0A}\x{0C} /?>]*)~Smui",


Suggested change

"~)|<!\[CDATA\[(?>.*?>)|<\?(?>.*?)>|<(?P<TAG>[a-z][^\t\x{0A}\x{0C} /?>]*)~Smui",

"~)|<!\[CDATA\[(?>.*?\]\]>)|<\?(?>.*?)>|<(?P<TAG>[a-z][^\t\x{0A}\x{0C} /?>]*)~Smui",

CDATA section ends with ]]>, not just >.

adamziel · 2022-07-26T10:17:18Z

lib/experimental/class-wp-html-processor.php

+	public function find_next_attribute() {
+		// Find the attribute name
+		if ( 1 !== preg_match(
+			'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu',


Suggested change

'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu',

'~[\t\x{0a}\x{0c} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu',

The Before attribute name state ignores the following characters:

U+0009 CHARACTER TABULATION (tab)

U+000A LINE FEED (LF)

U+000C FORM FEED (FF)

U+0020 SPACE

But it does not mention U+000D

adamziel · 2022-07-26T10:19:20Z

lib/experimental/class-wp-html-processor.php

+	public function find_next_attribute() {
+		// Find the attribute name
+		if ( 1 !== preg_match(
+			'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu',


Suggested change

'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0C} ]*)~Smiu',

'~[\t\x{0a}\x{0c}\x{0d} ]*(?P<NAME>=?[^=/>\t\x{0A}x{0C} ]*)~Smiu',

The Attribute Name State switches to after attribute name on any of the following:

U+0009 CHARACTER TABULATION (tab)

U+000A LINE FEED (LF)

U+000C FORM FEED (FF)

U+0020 SPACE

U+002F SOLIDUS (/)

U+003E GREATER-THAN SIGN (>)

EOF

So I added the missing U+000A

updated in a6b7d67

adamziel · 2022-07-26T10:24:58Z

lib/experimental/class-wp-html-processor.php

+
+		$name_token = WP_HTML_Scanner_Token::from_preg_match( $attribute_match['NAME'] );
+
+		$this->start_at = $start_at + strlen( $full_match );


This needs to ignore the following whitespace characters around the = just as the After attribute name state
and Before attribute value state specify:

U+0009 CHARACTER TABULATION (tab)

U+000A LINE FEED (LF)

U+000C FORM FEED (FF)

U+0020 SPACE

This markup correctly defines an attribute attr with a value value:

<div attr= "value" />

However, the processor currently doesn't deal with that as expected:

$p = new WP_HTML_Processor( '<div attr = "old" />' ); $p->find( ['tag_name'=>'span'] ); $p->set_attribute( 'attr', 'updated' ); echo $p; // <div attr="updated" = "old" />

I couldn't remember why we had to alternations for the leading attribute name = and so I removed it, but now I think this is the very reason, to swallow the post-attribute-equal whitespace. Will update the pattern 👍

adamziel · 2022-07-26T10:26:08Z

lib/experimental/class-wp-html-processor.php

+				break;
+
+			default:
+				$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c}\x{0d} >]*)~Smiu';


Suggested change

$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c}\x{0d} >]*)~Smiu';

$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c} >]*)~Smiu';

U+000D is not on the list of characters ignored by the Attribute value (unquoted) state. These are:

U+0009 CHARACTER TABULATION (tab)

U+000A LINE FEED (LF)

U+000C FORM FEED (FF)

U+0020 SPACE

updated in a6b7d67

@adamziel

Co-authored by: @adamziel

@adamziel

The modifications are only ever class names and textual patching and the intermediate class wasn't as useful as it first seemed. Co-authored by: @adamziel

A class ends up _significantly_ more efficient in terms of both memory use and (to a lesser extent) runtime performance. In testing with the HTML5 spec (single-page.html) adding a class and an attribute to every tag the following was observed on my laptop at the time of writing the patch: | | Runtime | MB before toString | |-------|---------|--------------------| | array | 2400ms | 158 | | class | 2100ms | 8 | | diff | 88% | 5% | While not a scientific experiment I repeated this multiple times and the results stayed within the approximate range in the table above.

…ments. This avoids doing a full sort at the end of the processing run, which for large operations substantially reduces runtime. On the `single-page.html` document, adding a class and attribute to every tag, this took the runtime from 2100ms to 1400ms, or to around 67% of the previous runtime.

In testing on `single-page.html` the replacements array went from using 121 MB to 98 MB, down to 81% of its pre-patch usage. Runtime performance wasn't notablly different.

dmsnell · 2022-08-05T15:55:25Z

Closing in favor of #42485

WIP: Introduce HTML processor for server-side block modification on r…

c197fed

…ender Currently provides some data structures necessary for tracking and modifying an HTML document. No implementation code exists yet and this is mainly here to establish system design and documentation.

dmsnell requested review from adamziel and gziolo July 18, 2022 14:02

dmsnell added 2 commits July 18, 2022 16:47

Start adding high-level functions to implementation.

4416714

Various updates, some implementation code

687bc33

dmsnell mentioned this pull request Jul 19, 2022

[Heading block] Add a wp-block-heading CSS class #42122

Merged

dmsnell added 3 commits July 20, 2022 15:56

More implementation work on the tokenizer

46d3de0

Fix some missing links in order to get the scanner to run

9de8a46

Fix some bugs discovered while processing the single-page.html HTML…

2c3a48f

…5 spec

gziolo reviewed Jul 21, 2022

View reviewed changes

gziolo added the [Type] Technical Prototype Offers a technical exploration into an idea as an example of what's possible label Jul 21, 2022

dmsnell added 5 commits July 21, 2022 21:23

Add RegExp flags, force strict type check (fix earlier typo)

2d17de1

add modification functions

4f0a7ec

Commit class changes before moving to next tag with find()

d6830e2

More small updates, comments, static function speedup

b2ef9d5

Remove unnecessary tracking information

6248ef4

adamziel reviewed Jul 26, 2022

View reviewed changes

dmsnell mentioned this pull request Jul 26, 2022

WP_HTML_Tag_Processor: Inject dynamic data to block HTML markup in PHP #42485

Merged

dmsnell added 7 commits July 26, 2022 13:38

Early-abort when applying changes if there are none.

67662d0

Fix HTML patterns from PR review

a6b7d67

Co-authored by: @adamziel

Remove $this->modifications indirection

5eb0e7e

The modifications are only ever class names and textual patching and the intermediate class wasn't as useful as it first seemed. Co-authored by: @adamziel

Hide functions that don't have a need for a public interface

0e79668

Store replacements in SplFixedArray to avoid memory bloat

54cfe37

In testing on `single-page.html` the replacements array went from using 121 MB to 98 MB, down to 81% of its pre-patch usage. Runtime performance wasn't notablly different.

dmsnell closed this Aug 5, 2022

dmsnell deleted the add/html-processor branch August 5, 2022 15:55



		public function add_class( $class_name ) {
		$this->modifications->class_changes[ $class_name ] = true;



		public function remove_class( $class_name ) {
		$this->modifications->class_changes[ $class_name ] = false;



		public function remove_attribute( $name ) {
		if ( ! array_key_exists( $name, $this->scanner->attributes ) ) {

	public function commit_class_changes() {
	private function commit_class_changes() {

		array_key_exists( $comparable_name, $this->modifications->class_changes ) &&
		false === $this->modifications->class_changes[ $comparable_name ]

-			return false;
+			$this->start_at = null;
+			$this->tag_name = null;
+			$this->tag_start = null;
+			$this->attributes = array();
+			return false;

-		list( $full_match, $start_at ) = $attribute_match[0];
+		$full_match =  WP_HTML_Scanner_Token::from_preg_match( $attribute_match[0] );
+		// ...
+		$this->start_at = $full_match->end;  // end would have to be added

	return new WP_HTML_Attribute_Token( $name_token, null, $name_token->start, $start_at + strlen( $full_match ) );
	return new WP_HTML_Attribute_Token( $name_token, null, $name_token->start, $name_token->end );

	"~<!--(?>.?-->)\|<!\[CDATA\[(?>.?>)\|<\?(?>.?)>\|<(?P<TAG>[a-z][^\t\x{0A}\x{0C} /?>])~Smui",
	"~<!--(?>.?-->)\|<!\[CDATA\[(?>.?\]\]>)\|<\?(?>.?)>\|<(?P<TAG>[a-z][^\t\x{0A}\x{0C} /?>])~Smui",

	'~[\t\x{0a}\x{0c}\x{0d} ](?P<NAME>=?[^=/>\t\x{0C} ])~Smiu',
	'~[\t\x{0a}\x{0c} ](?P<NAME>=?[^=/>\t\x{0C} ])~Smiu',


		$name_token = WP_HTML_Scanner_Token::from_preg_match( $attribute_match['NAME'] );

		$this->start_at = $start_at + strlen( $full_match );

	$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c}\x{0d} >]*)~Smiu';
	$pattern = '~(?P<VALUE>[^\t\x{0a}\x{0c} >]*)~Smiu';

WIP: Introduce HTML processor for server-side block modification on render #42507

WIP: Introduce HTML processor for server-side block modification on render #42507

Conversation

dmsnell commented Jul 18, 2022

What

Why?

How?

Testing Instructions

gziolo Jul 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

adamziel Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

gziolo Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gziolo Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmsnell Jul 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gziolo Jul 21, 2022 •

edited

Loading

adamziel Jul 26, 2022 •

edited

Loading

adamziel Jul 26, 2022 •

edited

Loading

gziolo Jul 27, 2022 •

edited

Loading

gziolo Jul 27, 2022 •

edited

Loading

adamziel Jul 26, 2022 •

edited

Loading

adamziel Jul 26, 2022 •

edited

Loading

dmsnell Jul 28, 2022 •

edited

Loading

adamziel Jul 26, 2022 •

edited

Loading

adamziel Jul 26, 2022 •

edited

Loading

adamziel Jul 26, 2022 •

edited

Loading

adamziel Jul 26, 2022 •

edited

Loading

adamziel Jul 26, 2022 •

edited

Loading

adamziel Jul 26, 2022 •

edited

Loading