Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add/html character reference decoder #47040

Closed
wants to merge 10 commits into from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Jan 10, 2023

Trac ticket: Core-60841

What?

Adds a new HTML character reference decoder class, used by tag processor, for properly decoding HTML character references (entities). Leaves junk input in output, e.g. when HTML calls to replace a sequence with the replacement character U+FFFD (�) this decoder leaves in the raw input so that it won't change something it doesn't need to.

Why?

html_entity_decode() is an insufficient function in two ways:

  • it isn't aware of the ambiguous ampersand rule which leads to different decoding based on whether the encoded text comes from an HTML attribute or from data (or a few other contexts)
  • it doesn't properly decode all the allowable entities and variations allowed by HTML5
    • it requires a terminating ;
    • doesn't allow zero-extended prefixes to numerical character references e.g. &#x0000065
    • doesn't decode the C1 control character replacements, e.g. &#x80 is not the padding character

How?

Scans an input string for character entities and decodes them as numeric references or as named referenced, looking up names from the HTML5 spec

When performing named character decoding this decoder groups names by their first two letters, forming a naming "group." That group usually contains only a few named references. When finding the appropriate group, we iterate over the candidate names in that group to determine if the input contains that exact name match, and if we do, use that match to determine which text to replace in the input string.

Testing

Need to add tests to confirm this behavior.

Some basic tests so far with the HTML5 spec single-page.html shows very little or no noticeable impact on performance, but slightly increased memory use, probably because of how this is string-copying the class attribute for comparison. There are optimizations we could explore to avoid this allocation.

raw input how it should be decoded in an attribute (this PR) how it should be decoded in markup (this PR) how PHP decodes it
test test test test
&sirnotinthisfilm; &sirnotinthisfilm; &sirnotinthisfilm; &sirnotinthisfilm;
&#x0000065 e e &#x0000065
a&#x1F170b a&#x1F170b a&#x1F170b a&#x1F170b
a🅰b a🅰b a🅰b a🅰b
a🅰b a🅰b a🅰b a🅰b
a�b a�b a�b a�b
&#x10FFFCt 􏿼t 􏿼t &#x10FFFCt
&#x10FFFC5 &#x10FFFC5 &#x10FFFC5 &#x10FFFC5
&#1114101 􏿵 􏿵 &#1114101
&#1114101t 􏿵t 􏿵t &#1114101t
&#11141015 &#11141015 &#11141015 &#11141015
a&#x1F170x;b a🅰x;b a🅰x;b a&#x1F170x;b
a&#1337b aԹb aԹb a&#1337b
aԹb aԹb aԹb aԹb
a&ampb a&ampb a&b a&ampb
a&b a&b a&b a&b
a&notin a bind a&notin a bind a¬in a bind a&notin a bind
a∉b a∉b a∉b a∉b
a&notinb a&notinb a¬inb a&notinb
Ă Ă Ă Ă
&Abreve &Abreve &Abreve &Abreve
Á Á Á Á
&Aacute Á Á &Aacute
Ála carte Ála carte Ála carte Ála carte
&Aacutela carte &Aacutela carte Ála carte &Aacutela carte
&Aacute la carte Á la carte Á la carte &Aacute la carte
&Aacute=la carte &Aacute=la carte Á=la carte &Aacute=la carte
&Aacute*la carte Á*la carte Á*la carte &Aacute*la carte
€‡•œ €‡•œ €‡•œ €‡•œ
&#t; &#t; &#t; &#t;
�&#xdd70 �&#xdd70 �&#xdd70 �&#xdd70
   
   
&#x09&#x10&#x20 &#x10 &#x10 &#x09&#x10&#x20
&#x-65; &#x-65; &#x-65; &#x-65;

@dmsnell dmsnell requested a review from spacedmonkey as a code owner January 10, 2023 21:49
@dmsnell dmsnell marked this pull request as draft January 10, 2023 21:50
@github-actions
Copy link

github-actions bot commented Jan 10, 2023

Flaky tests detected in b8e9430.
Some tests passed with failed attempts. The failures may not be related to this commit but are still reported for visibility. See the documentation for more information.

🔍 Workflow run URL: https://github.com/WordPress/gutenberg/actions/runs/3994110924
📝 Reported issues:

*
* @var string[] named character reference information
*/
public static $lookup_table = array(
Copy link
Contributor

@adamziel adamziel Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to initialize this on the first use and avoid allocating the memory until the first use? E.g. implementing something like...

public static get_lookup_table() {
    if ( ! self::$lookup_table ) {
        self::$lookup_table = /* ... */;
    }
    return self::$lookup_table;
}

If PHP always brings the methods opcodes into memory then it wouldn't help at all, but it's worth a shot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as static string literals I don't think this is possible, or that it will save any memory, because those strings already have to be in memory for it to lazily-initialize them.

this is down to a negligible amount IMO and constrasts a few MBs from clearer constructs, such as the array you mention before, such as the array introduced in d663e70

Comment on lines 110 to 112
$at += $digit_count;
$buffer .= substr( $input, $next, $at - $next );
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: just continue; should suffice here as well – the next loop iteration will skip over the digits anyway.

Comment on lines +172 to +190
* For group "qu", during lookup that will find """
*
* ┌─────┬────┬──────┬────┬──────────────┬────┬─────┐
* │ ... │ N5 │ Name │ S5 │ Substitution │ N6 │ ... │
* ├─────┼────┼──────┼────┼──────────────┼────┼─────┤
* │ ... │ 04 │ ote; │ 01 │ " │ 03 │ ... │
* └─────┴────┴──────┴────┴──────────────┴────┴─────┘
* ^^ ^^
* | |
* | ╰ The substitution is one byte,
* | even though it's represented in
* | the string literal as "\x22", which
* | is done for the sake of avoiding
* | quoting issues in PHP.
* |
* ╰ The "ote;" is four bytes (the finishing of &quo̱ṯe̱;̱)
*
* The part of the group string this represents follows:
* > ...\x04ote;\x01\x22\x03...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused with this comment. It says the group qu will find ", but the example talks about &quote;. What does N5, S5, and N6 mean in the header? I figured out the group string contains different possible substitutions to reduce the size of the lookup table, but I'm not sure what else is there to know about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked up that entry in $lookup_table and it differs from what's described in this comment:

"QU" => "\x03OT;\x01\x22\x02OT\x01\x22",           // " &QUOT

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ote; is a typo, but you're not looking up the listed entry, since you looked up QU instead of qu. QU might be a better example though since it only has the two entries and this one has more.

// ℍ ⨖ ≟ ? " &quot
"qu" => "\x0aaternions;\x03ℍ\x06atint;\x03⨖\x06esteq;\x03≟\x04est;\x01?\x03ot;\x01\x22\x02ot\x01\x22",

still working on the table; trying to show the structural layout of the lookups

@@ -13,13 +13,7 @@
* @TODO: Clean up attribute token class after is_true addition
* @TODO: Prune whitespace when removing classes/attributes: e.g. "a b c" -> "c" not " c"
* @TODO: Skip over `/` in attributes area, split attribute names by `/`
* @TODO: Decode HTML references/entities in class names when matching.
* E.g. match having class `1<"2` needs to recognize `class="1&lt;&quot;2"`.
* @TODO: Decode character references in `get_attribute()`
* @TODO: Properly escape attribute value in `set_attribute()`
Copy link
Contributor

@adamziel adamziel Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that still apply? As long as the " character is escaped everything should be fine even if nothing else is escaped?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following your question. can you rephrase it?

the class value of a&#x65a should be interpreted as having class aaa, so the stylesheet to apply rules to it would look like this…

.aaa { … }

these TODOs thus addressed two different issues:

  • if we search for a tag with class wp-block it should find a tag even if it's written in the tag as class="wp&#x2dblock" or class='wp&#45block' or any other variant with character references. strangely the class &#x1FFFF (and others like it with invalid character references) map to the CSS class name .
  • when decoding attribute values PHP has been returning potentially invalid strings since html_entity_decode is both broken by implementation and by design.

* @var string[] named character reference information
*/
public static $lookup_table = array(
"AE" => "\x04lig;\x02Æ\x03lig\x02Æ", // &AElig; &AElig
Copy link
Contributor

@adamziel adamziel Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this $lookup_table. Why not store it as "AE" => ["lig;" => "Æ", "lig" => "Æ"]? Is this a memory optimization? Here's how _zend_string is stored:

struct _zend_string {
	zend_refcounted_h gc;
	zend_ulong        h;                /* hash value */
	size_t            len;
	char              val[1];
};

len is just like size byte you use, and val[1] is used to store the actual contents (the length 1 is a "struct hack"). On top of that there's a few integer allocations for the hash (computed on the first use) and the gc reference. So each zend_string would take up a few integer allocations more and enable less complex code.

What are your thoughts on that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was an optimization; see d663e70 for the previous data structure. I looked at another, which was storing these as static properties of an auto-generated class using a Character_Reference class with $name and $substitution properties, but each one had its own costs.

you're right that Zend strings don't require more storage, but a couple things led me away from the array as you wrote it:

  • array access involves a few steps of memory indirection and we don't know if the second or third name in a group will be close or distant in memory. the packed string format I'm using currently should keep most candidate names for a group within a single cache line and all name/substitution values except potentially a few very long ones at that. I haven't measured the impact of this, but at least we know we're preserving locality during iteration of the candidates and will avoid all further memory lookups.
  • array key lookup requires an additional hashing step which here I don't think will be necessary. we're still doing it for groups, but I'm trying to balance the different tradeoffs and minimize the search space as quickly as possible. the packed string is mostly this same structure you're proposing except we know it comes from a contiguous sequence of bytes vs. a group of pointer dereferences.

because it's all been somewhat fast I've only been looking at memory overhead and as before, arrays have surprised me how much they add. I thought through another few approaches whereby the substitutions and names were all found in a single string, but those didn't offer much and were much messier in an editor.

additionally I looked at packing a tree in a single string which could potentially reduce the size of the table dramatically by avoiding most duplicated letters; it would let us scan one letter at a time (and also avoid new allocations) but I didn't code it up yet for no reason other than time.

there are 2231 named character references and we know we'd need at least a few bytes for each one, let alone the substitutions which total 6420 bytes. if we can get this down to a couple extra bytes of overhead for each one we're still looking at around a minimum of 10KB vs. 135KB, within one order of magnitude, and I don't think it's bad to add that much code on the server.

I'm not exactly sure what this approach would be like, but maybe I'll do an exploration at some point. this seemed pretty much good enough for now 🤷‍♂️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just played with packing trie into a string for fun: https://gist.github.com/adamziel/a628fc2d63a7733f9fed0dc9c3c34705

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamziel given that your trie representation still has some issues it's hard for me to test, but, I also find it maybe much more difficult to comprehend when reading. I do admit that my packed structure is itself a bit difficult to see, but at least the groups are easy to scan to and most are short enough where you can see all the names and replacements.

I predict that given we are doing one array lookup in my code based on the first two characters, that it's sufficiently optimized over the more abstract tree where we have approximately as many lookups as we have letters in the character name (less an optimization for when only one name remains given the prefix). the string scanning within a group also should be prompt given that the names within a group should be very local in memory and read in the same burst.

that being said, and knowing that in most cases we won't have any character references in an attribute, I guess we could relax this packing and go back to what's in the original commit (or a slight improvement) and see if we care about the performance.

ironically I do find the packed string in some ways more legible than the noisier array. concerning memory though there's a big difference between 'AE' => "\x04lig;\x02Æ\x03lig\x02Æ" and this…

	'AE' => array( 'lig;', 'Æ', 'lig', 'Æ' ),
	'AM' => array( 'P;', '&', 'P', '&' ),
	…

Whereas the packed string is a single memory lookup where all the values coincide, the array version includes string values which might be at very different locations in memory, so we still have in the case of &lig at least four separate memory lookups once we arrive at the group, since each value in the array is going to be a pointer to some other location with the actual string data.


As a side-note I find that any optimization to make the semi-colon optional is a bit fraught because it's not the case that all semi-colons are optional or required. so as far as the spec goes, the semicolon might as well be a normal letter.


For memory limiting I wonder how the trie implementation would look after fixing it. We have to make sure it finds &lig; before it finds &lig and it has to know at each prefix if there are branches, but also if there's a termination.

For example, the trie as written encodes AElig; and AElig in the same structure…

array(52) {
  ["A"]=>
  array(16) {
    ["E"]=>
    array(1) {
      ["l"]=>
      array(1) {
        ["i"]=>
        array(1) {
          ["g"]=>
          string(2) "Æ"
        }
      }
    }
  }
}

it looks like the problem here is that it finds the version without the semicolon and when the path is empty it overrides the existing [";"]=>string(2) "Æ" that was there already.

I was able to easily modify the script so that it stores the terminators at each branch point, but that made the packed table increase to 26 KB, and this trie traversal looks relatively slow compared to the string comparing. is 110 KB, if we can fit it in there, worth the trie?

@dmsnell dmsnell force-pushed the add/html-character-reference-decoder branch from 026e439 to b8e9430 Compare January 24, 2023 07:59
@gziolo gziolo added the [Feature] Block API API that allows to express the block paradigm. label Jan 31, 2023
@adamziel adamziel mentioned this pull request Feb 9, 2023
26 tasks
@gziolo gziolo added [Feature] HTML API An API for updating HTML attributes in markup and removed [Feature] Block API API that allows to express the block paradigm. labels Apr 18, 2023
@dmsnell
Copy link
Member Author

dmsnell commented May 24, 2024

Replaced by WordPress/wordpress-develop#6387

@dmsnell dmsnell closed this May 24, 2024
@dmsnell dmsnell deleted the add/html-character-reference-decoder branch May 24, 2024 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] HTML API An API for updating HTML attributes in markup
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants