Skip to content

Commit

Permalink
Tag Processor: Add bookmark system for tracking semantic locations in…
Browse files Browse the repository at this point in the history
… document

It can be helpful to track a location in an HTML document while updates
are being made to it such that we can instruct the Tag Processor to seek
to the location of one of the bookmarks.

In this patch we're introducing a bookmarks system to do just that.
Bookmarks are referenced by name and handled internally by a tracking
object which will follow all updates made to the document. It will be
possible to rewind or jump around a document by setting a bookmark and
then calling `seek( $bookmark_name )` to move there.

Co-authored-by: Adam Zielinski <adam@adamziel.com>
Co-authored-by: Dennis Snell <dennis.snell@automattic.com>
  • Loading branch information
dmsnell and adamziel committed Dec 2, 2022
1 parent 32ba7bd commit 5fcf00d
Show file tree
Hide file tree
Showing 5 changed files with 684 additions and 48 deletions.
52 changes: 52 additions & 0 deletions lib/experimental/html/class-wp-html-span.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
<?php
/**
* HTML Span: Represents a textual span inside an HTML document.
*
* @package WordPress
* @subpackage HTML
* @since 6.2.0
*/

/**
* Represents a textual span inside an HTML document.
*
* This is a two-tuple in disguise, used to avoid the memory
* overhead involved in using an array for the same purpose.
*
* This class is for internal usage of the WP_HTML_Tag_Processor class.
*
* @access private
* @since 6.2.0
*
* @see WP_HTML_Tag_Processor
*/
class WP_HTML_Span {
/**
* Byte offset into document where span begins.
*
* @since 6.2.0
* @var int
*/
public $start;

/**
* Byte offset into document where span ends.
*
* @since 6.2.0
* @var int
*/
public $end;

/**
* Constructor.
*
* @since 6.2.0
*
* @param int $start Byte offset into document where replacement span begins.
* @param int $end Byte offset into document where replacement span ends.
*/
public function __construct( $start, $end ) {
$this->start = $start;
$this->end = $end;
}
}
279 changes: 243 additions & 36 deletions lib/experimental/html/class-wp-html-tag-processor.php
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,25 @@
* @since 6.2.0
*/
class WP_HTML_Tag_Processor {
/**
* The maximum number of bookmarks allowed to exist at
* any given time.
*
* @see set_bookmark();
* @since 6.2.0
* @var int
*/
const MAX_BOOKMARKS = 10;

/**
* Maximum number of times seek() can be called.
* Prevents accidental infinite loops.
*
* @see seek()
* @since 6.2.0
* @var int
*/
const MAX_SEEK_OPS = 1000;

/**
* The HTML document to parse.
Expand Down Expand Up @@ -349,11 +368,11 @@ class WP_HTML_Tag_Processor {
*
* Example:
* <code>
* // Add the `WP-block-group` class, remove the `WP-group` class.
* $class_changes = [
* // Add the `wp-block-group` class, remove the `wp-group` class.
* $classname_updates = [
* // Indexed by a comparable class name
* 'wp-block-group' => new WP_Class_Name_Operation( 'WP-block-group', WP_Class_Name_Operation::ADD ),
* 'wp-group' => new WP_Class_Name_Operation( 'WP-group', WP_Class_Name_Operation::REMOVE )
* 'wp-block-group' => WP_HTML_Tag_Processor::ADD_CLASS,
* 'wp-group' => WP_HTML_Tag_Processor::REMOVE_CLASS
* ];
* </code>
*
Expand All @@ -362,6 +381,15 @@ class WP_HTML_Tag_Processor {
*/
private $classname_updates = array();

/**
* Tracks a semantic location in the original HTML which
* shifts with updates as they are applied to the document.
*
* @since 6.2.0
* @var WP_HTML_Span[]
*/
private $bookmarks = array();

const ADD_CLASS = true;
const REMOVE_CLASS = false;
const SKIP_CLASS = null;
Expand Down Expand Up @@ -396,6 +424,16 @@ class WP_HTML_Tag_Processor {
*/
private $attribute_updates = array();

/**
* Tracks how many times we've performed a `seek()`
* so that we can prevent accidental infinite loops.
*
* @see seek
* @since 6.2.0
* @var int
*/
private $seek_count = 0;

/**
* Constructor.
*
Expand Down Expand Up @@ -479,6 +517,123 @@ public function next_tag( $query = null ) {
return true;
}


/**
* Sets a bookmark in the HTML document.
*
* Bookmarks represent specific places or tokens in the HTML
* document, such as a tag opener or closer. When applying
* edits to a document, such as setting an attribute, the
* text offsets of that token may shift; the bookmark is
* kept updated with those shifts and remains stable unless
* the entire span of text in which the token sits is removed.
*
* Release bookmarks when they are no longer needed.
*
* Example:
* ```
* <main><h2>Surprising fact you may not know!</h2></main>
* ^ ^
* \-|-- this `H2` opener bookmark tracks the token
*
* <main class="clickbait"><h2>Surprising fact you may no…
* ^ ^
* \-|-- it shifts with edits
* ```
*
* Bookmarks provide the ability to seek to a previously-scanned
* place in the HTML document. This avoids the need to re-scan
* the entire thing.
*
* Example:
* ```
* <ul><li>One</li><li>Two</li><li>Three</li></ul>
* ^^^^
* want to note this last item
*
* $p = new WP_HTML_Tag_Processor( $html );
* $in_list = false;
* while ( $p->next_tag( [ 'tag_closers' => $in_list ? 'visit' : 'skip' ] ) ) {
* if ( 'UL' === $p->get_tag() ) {
* if ( $p->is_tag_closer() ) {
* $in_list = false;
* $p->set_bookmark( 'resume' );
* if ( $p->seek( 'last-li' ) ) {
* $p->add_class( 'last-li' );
* }
* $p->seek( 'resume' );
* $p->release_bookmark( 'last-li' );
* $p->release_bookmark( 'resume' );
* } else {
* $in_list = true;
* }
* }
*
* if ( 'LI' === $p->get_tag() ) {
* $p->set_bookmark( 'last-li' );
* }
* }
* ```
*
* Because bookmarks maintain their position they don't
* expose any internal offsets for the HTML document
* and can't be used with normal string functions.
*
* Because bookmarks allocate memory and require processing
* for every applied update they are limited and require
* a name. They should not be created inside a loop.
*
* Bookmarks are a powerful tool to enable complicated behavior;
* consider double-checking that you need this tool if you are
* reaching for it, as inappropriate use could lead to broken
* HTML structure or unwanted processing overhead.
*
* @param string $name Identifies this particular bookmark.
* @return false|void
* @throws Exception Throws on invalid bookmark name if WP_DEBUG set.
*/
public function set_bookmark( $name ) {
if ( null === $this->tag_name_starts_at ) {
return false;
}

if ( ! array_key_exists( $name, $this->bookmarks ) && count( $this->bookmarks ) >= self::MAX_BOOKMARKS ) {
if ( defined( 'WP_DEBUG' ) && WP_DEBUG ) {
throw new Exception( "Tried to jump to a non-existent HTML bookmark {$name}." );
}
return false;
}

$this->bookmarks[ $name ] = new WP_HTML_Span(
$this->tag_name_starts_at - 1,
$this->tag_ends_at
);

return true;
}


/**
* Removes a bookmark if you no longer need to use it.
*
* Releasing a bookmark frees up the small performance
* overhead they require, mainly in the form of compute
* costs when modifying the document.
*
* @param string $name Name of the bookmark to remove.
* @return bool
*/
public function release_bookmark( $name ) {
if ( ! array_key_exists( $name, $this->bookmarks ) ) {
return false;
}

unset( $this->bookmarks[ $name ] );

return true;
}


/**
* Skips the contents of the title and textarea tags until an appropriate
* tag closer is found.
Expand Down Expand Up @@ -1104,9 +1259,77 @@ private function apply_attributes_updates() {
$this->updated_bytes = $diff->end;
}

foreach ( $this->bookmarks as $bookmark ) {
/**
* As we loop through $this->attribute_updates, we keep comparing
* $bookmark->start and $bookmark->end to $diff->start. We can't
* change it and still expect the correct result, so let's accumulate
* the deltas separately and apply them all at once after the loop.
*/
$head_delta = 0;
$tail_delta = 0;

foreach ( $this->attribute_updates as $diff ) {
$update_head = $bookmark->start >= $diff->start;
$update_tail = $bookmark->end >= $diff->start;

if ( ! $update_head && ! $update_tail ) {
break;
}

$delta = strlen( $diff->text ) - ( $diff->end - $diff->start );

if ( $update_head ) {
$head_delta += $delta;
}

if ( $update_tail ) {
$tail_delta += $delta;
}
}

$bookmark->start += $head_delta;
$bookmark->end += $tail_delta;
}

$this->attribute_updates = array();
}

/**
* Move the current pointer in the Tag Processor to a given bookmark's location.
*
* In order to prevent accidental infinite loops, there's a
* maximum limit on the number of times seek() can be called.
*
* @param string $bookmark_name Jump to the place in the document identified by this bookmark name.
* @return bool
* @throws Exception Throws on invalid bookmark name if WP_DEBUG set.
*/
public function seek( $bookmark_name ) {
if ( ! array_key_exists( $bookmark_name, $this->bookmarks ) ) {
if ( defined( 'WP_DEBUG' ) && WP_DEBUG ) {
throw new Exception( 'Invalid bookmark name' );
}
return false;
}

if ( ++$this->seek_count > self::MAX_SEEK_OPS ) {
if ( defined( 'WP_DEBUG' ) && WP_DEBUG ) {
throw new Exception( 'Too many calls to seek() - this can lead to performance issues.' );
}
return false;
}

// Flush out any pending updates to the document.
$this->get_updated_html();

// Point this tag processor before the sought tag opener and consume it.
$this->parsed_bytes = $this->bookmarks[ $bookmark_name ]->start;
$this->updated_bytes = $this->parsed_bytes;
$this->updated_html = substr( $this->html, 0, $this->updated_bytes );
return $this->next_tag();
}

/**
* Sort function to arrange objects with a start property in ascending order.
*
Expand Down Expand Up @@ -1411,47 +1634,31 @@ public function __toString() {
* @return string The processed HTML.
*/
public function get_updated_html() {
// Short-circuit if there are no updates to apply.
// Short-circuit if there are no new updates to apply.
if ( ! count( $this->classname_updates ) && ! count( $this->attribute_updates ) ) {
return $this->updated_html . substr( $this->html, $this->updated_bytes );
}

/*
* Parsing is in progress – let's apply the attribute updates without moving on to the next tag.
*
* In practice:
* 1. Apply the attributes updates to the original HTML
* 2. Replace the original HTML with the updated HTML
* 3. Point this tag processor to the current tag name's end in that updated HTML
*/

// Find tag name's end in the updated markup.
$markup_updated_up_to_a_tag_name_end = $this->updated_html . substr( $this->html, $this->updated_bytes, $this->tag_name_starts_at + $this->tag_name_length - $this->updated_bytes );
$updated_tag_name_ends_at = strlen( $markup_updated_up_to_a_tag_name_end );
$updated_tag_name_starts_at = $updated_tag_name_ends_at - $this->tag_name_length;
// Otherwise: apply the updates, rewind before the current tag, and parse it again.
$delta_between_updated_html_end_and_current_tag_end = substr(
$this->html,
$this->updated_bytes,
$this->tag_name_starts_at + $this->tag_name_length - $this->updated_bytes
);
$updated_html_up_to_current_tag_name_end = $this->updated_html . $delta_between_updated_html_end_and_current_tag_end;

// Apply attributes updates.
$this->updated_html = $markup_updated_up_to_a_tag_name_end;
$this->updated_bytes = $this->tag_name_starts_at + $this->tag_name_length;
// 1. Apply the attributes updates to the original HTML
$this->class_name_updates_to_attributes_updates();
$this->apply_attributes_updates();

// Replace $this->html with the updated markup.
$this->html = $this->updated_html . substr( $this->html, $this->updated_bytes );
// 2. Replace the original HTML with the updated HTML
$this->html = $this->updated_html . substr( $this->html, $this->updated_bytes );
$this->updated_html = $updated_html_up_to_current_tag_name_end;
$this->updated_bytes = strlen( $this->updated_html );

// Rewind this processor to the tag name's end.
$this->tag_name_starts_at = $updated_tag_name_starts_at;
$this->parsed_bytes = $updated_tag_name_ends_at;

// Restore the previous version of the updated_html as we are not finished with the current_tag yet.
$this->updated_html = $markup_updated_up_to_a_tag_name_end;
$this->updated_bytes = $updated_tag_name_ends_at;

// Parse the attributes in the updated markup.
$this->attributes = array();
while ( $this->parse_next_attribute() ) {
continue;
}
// 3. Point this tag processor at the original tag opener and consume it
$this->parsed_bytes = strlen( $updated_html_up_to_current_tag_name_end ) - $this->tag_name_length - 2;
$this->next_tag();

return $this->html;
}
Expand Down
Loading

3 comments on commit 5fcf00d

@anton-vlasenko
Copy link
Contributor

@anton-vlasenko anton-vlasenko commented on 5fcf00d Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit broke PHP 5.6 unit tests. FYI @dmsnell

1) Render_Block_Heading_Test::test_block_core_heading_render_appends_css_class_to_a_vanilla_element with data set "should add a class name to a vanilla h2 element" ('<h2>Hello World</h2>', '<h2 class="wp-block-heading">...d</h2>')
strpos(): Offset not contained in string

/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:853
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:478
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:1661
/var/www/html/wp-content/plugins/gutenberg/build/block-library/blocks/heading.php:37
/var/www/html/wp-content/plugins/gutenberg/phpunit/blocks/render-block-heading-test.php:22
phpvfscomposer:///var/www/html/wp-content/plugins/gutenberg/vendor/phpunit/phpunit/phpunit:51

2) Render_Block_Heading_Test::test_block_core_heading_render_appends_css_class_to_a_vanilla_element with data set "should add a class name even when the class attribute is already defined" ('<h2 class="is-align-right">He...d</h2>', '<h2 class="is-align-right wp-...d</h2>')
strpos(): Offset not contained in string

/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:853
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:478
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:1661
/var/www/html/wp-content/plugins/gutenberg/build/block-library/blocks/heading.php:37
/var/www/html/wp-content/plugins/gutenberg/phpunit/blocks/render-block-heading-test.php:22
phpvfscomposer:///var/www/html/wp-content/plugins/gutenberg/vendor/phpunit/phpunit/phpunit:51

3) Render_Block_Heading_Test::test_block_core_heading_render_appends_css_class_to_a_vanilla_element with data set "should handle single quotes" ('<h2 class='is-align-right'>He...d</h2>', '<h2 class="is-align-right wp-...d</h2>')
strpos(): Offset not contained in string

/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:853
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:478
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:1661
/var/www/html/wp-content/plugins/gutenberg/build/block-library/blocks/heading.php:37
/var/www/html/wp-content/plugins/gutenberg/phpunit/blocks/render-block-heading-test.php:22
phpvfscomposer:///var/www/html/wp-content/plugins/gutenberg/vendor/phpunit/phpunit/phpunit:51

4) Render_Block_Heading_Test::test_block_core_heading_render_appends_css_class_to_a_vanilla_element with data set "should handle single quotes with double quotes inside" ('<h2 class='" is-align-right'>...d</h2>', '<h2 class="&quot; is-align-ri...d</h2>')
strpos(): Offset not contained in string

/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:853
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:478
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:1661
/var/www/html/wp-content/plugins/gutenberg/build/block-library/blocks/heading.php:37
/var/www/html/wp-content/plugins/gutenberg/phpunit/blocks/render-block-heading-test.php:22
phpvfscomposer:///var/www/html/wp-content/plugins/gutenberg/vendor/phpunit/phpunit/phpunit:51

5) Render_Block_Heading_Test::test_block_core_heading_render_appends_css_class_to_a_vanilla_element with data set "should not add a class name even when it is already defined" ('<h2 class="is-align-right wp-...d</h2>', '<h2 class="is-align-right wp-...d</h2>')
strpos(): Offset not contained in string

/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:853
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:478
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:1661
/var/www/html/wp-content/plugins/gutenberg/build/block-library/blocks/heading.php:37
/var/www/html/wp-content/plugins/gutenberg/phpunit/blocks/render-block-heading-test.php:22
phpvfscomposer:///var/www/html/wp-content/plugins/gutenberg/vendor/phpunit/phpunit/phpunit:51

6) Render_Block_Heading_Test::test_block_core_heading_render_appends_css_class_to_a_vanilla_element with data set "should add a class name even when there are other HTML attributes present" ('<h2 style="display: block">He...d</h2>', '<h2 class="wp-block-heading" ...d</h2>')
strpos(): Offset not contained in string

/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:853
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:478
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:1661
/var/www/html/wp-content/plugins/gutenberg/build/block-library/blocks/heading.php:37
/var/www/html/wp-content/plugins/gutenberg/phpunit/blocks/render-block-heading-test.php:22
phpvfscomposer:///var/www/html/wp-content/plugins/gutenberg/vendor/phpunit/phpunit/phpunit:51

7) Render_Block_Heading_Test::test_block_core_heading_render_appends_css_class_to_a_vanilla_element with data set "should add a class name even when the class attribute is already defined and has many entries" ('<h2 class="is-align-right cus...d</h2>', '<h2 class="is-align-right cus...d</h2>')
strpos(): Offset not contained in string

/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:853
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:478
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:1661
/var/www/html/wp-content/plugins/gutenberg/build/block-library/blocks/heading.php:37
/var/www/html/wp-content/plugins/gutenberg/phpunit/blocks/render-block-heading-test.php:22
phpvfscomposer:///var/www/html/wp-content/plugins/gutenberg/vendor/phpunit/phpunit/phpunit:51

8) Render_Block_Heading_Test::test_block_core_heading_render_appends_css_class_to_a_vanilla_element with data set "should not add a class name to a nested h2" ('<h2 class="is-align-right cus...></h2>', '<h2 class="is-align-right cus...></h2>')
strpos(): Offset not contained in string

/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:853
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:478
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:1661
/var/www/html/wp-content/plugins/gutenberg/build/block-library/blocks/heading.php:37
/var/www/html/wp-content/plugins/gutenberg/phpunit/blocks/render-block-heading-test.php:22
phpvfscomposer:///var/www/html/wp-content/plugins/gutenberg/vendor/phpunit/phpunit/phpunit:51

9) Render_Block_Heading_Test::test_block_core_heading_render_appends_css_class_to_a_vanilla_element with data set "should not add a class name to a nested h2 when the parent has another attribute" ('<h2 style="display: block" cl...></h2>', '<h2 style="display: block" cl...></h2>')
strpos(): Offset not contained in string

/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:853
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:478
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:1661
/var/www/html/wp-content/plugins/gutenberg/build/block-library/blocks/heading.php:37
/var/www/html/wp-content/plugins/gutenberg/phpunit/blocks/render-block-heading-test.php:22
phpvfscomposer:///var/www/html/wp-content/plugins/gutenberg/vendor/phpunit/phpunit/phpunit:51

10) Render_Block_Heading_Test::test_block_core_heading_render_appends_css_class_to_a_vanilla_element with data set "should add a class name even when the class attribute is surrounded by other attributes" ('<h2 style="display: block" cl...></h2>', '<h2 style="display: block" cl...></h2>')
strpos(): Offset not contained in string

/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:853
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:478
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:1661
/var/www/html/wp-content/plugins/gutenberg/build/block-library/blocks/heading.php:37
/var/www/html/wp-content/plugins/gutenberg/phpunit/blocks/render-block-heading-test.php:22
phpvfscomposer:///var/www/html/wp-content/plugins/gutenberg/vendor/phpunit/phpunit/phpunit:51

11) Render_Block_Heading_Test::test_block_core_heading_render_appends_css_class_to_a_vanilla_element with data set "should add a class name without getting confused when there is a tricky data-class attribute present" ('<h2 data-class="corner case!"...></h2>', '<h2 data-class="corner case!"...></h2>')
strpos(): Offset not contained in string

/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:853
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:478
/var/www/html/wp-content/plugins/gutenberg/lib/experimental/html/class-wp-html-tag-processor.php:1661
/var/www/html/wp-content/plugins/gutenberg/build/block-library/blocks/heading.php:37
/var/www/html/wp-content/plugins/gutenberg/phpunit/blocks/render-block-heading-test.php:22
phpvfscomposer:///var/www/html/wp-content/plugins/gutenberg/vendor/phpunit/phpunit/phpunit:51

@dmsnell
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm. sorry about that @anton-vlasenko - I rely on the CI test suites to confirm that something doesn't break and I didn't realize that was a false safety.

I'll try and look into this and see what happened.

@anton-vlasenko
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I absolutely agree with you on that, @dmsnell.
The CI jobs should be fixed to run PHPUnit tests on all PHP versions.
There is a PR (work in progress) to change that: #46510
Hopefully, it will be ready and merged before the end of the year.
Thank you for looking into it!

Please sign in to comment.