-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add/html character reference decoder #47040
Conversation
Flaky tests detected in b8e9430. 🔍 Workflow run URL: https://github.com/WordPress/gutenberg/actions/runs/3994110924
|
* | ||
* @var string[] named character reference information | ||
*/ | ||
public static $lookup_table = array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to initialize this on the first use and avoid allocating the memory until the first use? E.g. implementing something like...
public static get_lookup_table() {
if ( ! self::$lookup_table ) {
self::$lookup_table = /* ... */;
}
return self::$lookup_table;
}
If PHP always brings the methods opcodes into memory then it wouldn't help at all, but it's worth a shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as static string literals I don't think this is possible, or that it will save any memory, because those strings already have to be in memory for it to lazily-initialize them.
this is down to a negligible amount IMO and constrasts a few MBs from clearer constructs, such as the array you mention before, such as the array introduced in d663e70
$at += $digit_count; | ||
$buffer .= substr( $input, $next, $at - $next ); | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: just continue;
should suffice here as well – the next loop iteration will skip over the digits anyway.
* For group "qu", during lookup that will find """ | ||
* | ||
* ┌─────┬────┬──────┬────┬──────────────┬────┬─────┐ | ||
* │ ... │ N5 │ Name │ S5 │ Substitution │ N6 │ ... │ | ||
* ├─────┼────┼──────┼────┼──────────────┼────┼─────┤ | ||
* │ ... │ 04 │ ote; │ 01 │ " │ 03 │ ... │ | ||
* └─────┴────┴──────┴────┴──────────────┴────┴─────┘ | ||
* ^^ ^^ | ||
* | | | ||
* | ╰ The substitution is one byte, | ||
* | even though it's represented in | ||
* | the string literal as "\x22", which | ||
* | is done for the sake of avoiding | ||
* | quoting issues in PHP. | ||
* | | ||
* ╰ The "ote;" is four bytes (the finishing of &quo̱ṯe̱;̱) | ||
* | ||
* The part of the group string this represents follows: | ||
* > ...\x04ote;\x01\x22\x03... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused with this comment. It says the group qu
will find "
, but the example talks about "e;
. What does N5
, S5
, and N6
mean in the header? I figured out the group string contains different possible substitutions to reduce the size of the lookup table, but I'm not sure what else is there to know about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked up that entry in $lookup_table
and it differs from what's described in this comment:
"QU" => "\x03OT;\x01\x22\x02OT\x01\x22", // " "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ote;
is a typo, but you're not looking up the listed entry, since you looked up QU
instead of qu
. QU
might be a better example though since it only has the two entries and this one has more.
// ℍ ⨖ ≟ ? " "
"qu" => "\x0aaternions;\x03ℍ\x06atint;\x03⨖\x06esteq;\x03≟\x04est;\x01?\x03ot;\x01\x22\x02ot\x01\x22",
still working on the table; trying to show the structural layout of the lookups
@@ -13,13 +13,7 @@ | |||
* @TODO: Clean up attribute token class after is_true addition | |||
* @TODO: Prune whitespace when removing classes/attributes: e.g. "a b c" -> "c" not " c" | |||
* @TODO: Skip over `/` in attributes area, split attribute names by `/` | |||
* @TODO: Decode HTML references/entities in class names when matching. | |||
* E.g. match having class `1<"2` needs to recognize `class="1<"2"`. | |||
* @TODO: Decode character references in `get_attribute()` | |||
* @TODO: Properly escape attribute value in `set_attribute()` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that still apply? As long as the "
character is escaped everything should be fine even if nothing else is escaped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following your question. can you rephrase it?
the class
value of aٚ
should be interpreted as having class aaa
, so the stylesheet to apply rules to it would look like this…
.aaa { … }
these TODOs thus addressed two different issues:
- if we search for a tag with class
wp-block
it should find a tag even if it's written in the tag asclass="wp˛lock"
orclass='wp-block'
or any other variant with character references. strangely the class
(and others like it with invalid character references) map to the CSS class name�
. - when decoding attribute values PHP has been returning potentially invalid strings since
html_entity_decode
is both broken by implementation and by design.
* @var string[] named character reference information | ||
*/ | ||
public static $lookup_table = array( | ||
"AE" => "\x04lig;\x02Æ\x03lig\x02Æ", // Æ Æ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this $lookup_table
. Why not store it as "AE" => ["lig;" => "Æ", "lig" => "Æ"]
? Is this a memory optimization? Here's how _zend_string
is stored:
struct _zend_string {
zend_refcounted_h gc;
zend_ulong h; /* hash value */
size_t len;
char val[1];
};
len
is just like size byte you use, and val[1]
is used to store the actual contents (the length 1 is a "struct hack"). On top of that there's a few integer allocations for the hash (computed on the first use) and the gc reference. So each zend_string
would take up a few integer allocations more and enable less complex code.
What are your thoughts on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was an optimization; see d663e70 for the previous data structure. I looked at another, which was storing these as static properties of an auto-generated class using a Character_Reference
class with $name
and $substitution
properties, but each one had its own costs.
you're right that Zend strings don't require more storage, but a couple things led me away from the array as you wrote it:
- array access involves a few steps of memory indirection and we don't know if the second or third name in a group will be close or distant in memory. the packed string format I'm using currently should keep most candidate names for a group within a single cache line and all name/substitution values except potentially a few very long ones at that. I haven't measured the impact of this, but at least we know we're preserving locality during iteration of the candidates and will avoid all further memory lookups.
- array key lookup requires an additional hashing step which here I don't think will be necessary. we're still doing it for groups, but I'm trying to balance the different tradeoffs and minimize the search space as quickly as possible. the packed string is mostly this same structure you're proposing except we know it comes from a contiguous sequence of bytes vs. a group of pointer dereferences.
because it's all been somewhat fast I've only been looking at memory overhead and as before, arrays have surprised me how much they add. I thought through another few approaches whereby the substitutions and names were all found in a single string, but those didn't offer much and were much messier in an editor.
additionally I looked at packing a tree in a single string which could potentially reduce the size of the table dramatically by avoiding most duplicated letters; it would let us scan one letter at a time (and also avoid new allocations) but I didn't code it up yet for no reason other than time.
there are 2231 named character references and we know we'd need at least a few bytes for each one, let alone the substitutions which total 6420 bytes. if we can get this down to a couple extra bytes of overhead for each one we're still looking at around a minimum of 10KB vs. 135KB, within one order of magnitude, and I don't think it's bad to add that much code on the server.
I'm not exactly sure what this approach would be like, but maybe I'll do an exploration at some point. this seemed pretty much good enough for now 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just played with packing trie into a string for fun: https://gist.github.com/adamziel/a628fc2d63a7733f9fed0dc9c3c34705
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamziel given that your trie representation still has some issues it's hard for me to test, but, I also find it maybe much more difficult to comprehend when reading. I do admit that my packed structure is itself a bit difficult to see, but at least the groups are easy to scan to and most are short enough where you can see all the names and replacements.
I predict that given we are doing one array lookup in my code based on the first two characters, that it's sufficiently optimized over the more abstract tree where we have approximately as many lookups as we have letters in the character name (less an optimization for when only one name remains given the prefix). the string scanning within a group also should be prompt given that the names within a group should be very local in memory and read in the same burst.
that being said, and knowing that in most cases we won't have any character references in an attribute, I guess we could relax this packing and go back to what's in the original commit (or a slight improvement) and see if we care about the performance.
ironically I do find the packed string in some ways more legible than the noisier array. concerning memory though there's a big difference between 'AE' => "\x04lig;\x02Æ\x03lig\x02Æ"
and this…
'AE' => array( 'lig;', 'Æ', 'lig', 'Æ' ),
'AM' => array( 'P;', '&', 'P', '&' ),
…
Whereas the packed string is a single memory lookup where all the values coincide, the array version includes string values which might be at very different locations in memory, so we still have in the case of &lig
at least four separate memory lookups once we arrive at the group, since each value in the array is going to be a pointer to some other location with the actual string data.
As a side-note I find that any optimization to make the semi-colon optional is a bit fraught because it's not the case that all semi-colons are optional or required. so as far as the spec goes, the semicolon might as well be a normal letter.
For memory limiting I wonder how the trie implementation would look after fixing it. We have to make sure it finds &lig;
before it finds &lig
and it has to know at each prefix if there are branches, but also if there's a termination.
For example, the trie as written encodes AElig;
and AElig
in the same structure…
array(52) {
["A"]=>
array(16) {
["E"]=>
array(1) {
["l"]=>
array(1) {
["i"]=>
array(1) {
["g"]=>
string(2) "Æ"
}
}
}
}
}
it looks like the problem here is that it finds the version without the semicolon and when the path is empty it overrides the existing [";"]=>string(2) "Æ"
that was there already.
I was able to easily modify the script so that it stores the terminators at each branch point, but that made the packed table increase to 26 KB, and this trie traversal looks relatively slow compared to the string comparing. is 110 KB, if we can fit it in there, worth the trie?
- more data locality within each group - fewer memory indirections - less string scanning for length lookup
026e439
to
b8e9430
Compare
Replaced by WordPress/wordpress-develop#6387 |
Trac ticket: Core-60841
What?
Adds a new HTML character reference decoder class, used by tag processor, for properly decoding HTML character references (entities). Leaves junk input in output, e.g. when HTML calls to replace a sequence with the replacement character U+FFFD (�) this decoder leaves in the raw input so that it won't change something it doesn't need to.
Why?
html_entity_decode()
is an insufficient function in two ways:;
e
€
is€
not the padding characterHow?
Scans an input string for character entities and decodes them as numeric references or as named referenced, looking up names from the HTML5 spec
When performing named character decoding this decoder groups names by their first two letters, forming a naming "group." That group usually contains only a few named references. When finding the appropriate group, we iterate over the candidate names in that group to determine if the input contains that exact name match, and if we do, use that match to determine which text to replace in the input string.
Testing
Need to add tests to confirm this behavior.
Some basic tests so far with the HTML5 spec
single-page.html
shows very little or no noticeable impact on performance, but slightly increased memory use, probably because of how this is string-copying theclass
attribute for comparison. There are optimizations we could explore to avoid this allocation.test
test
test
test
&sirnotinthisfilm;
&sirnotinthisfilm;
&sirnotinthisfilm;
&sirnotinthisfilm;
e
e
e
e
a�
a�
a�
a�
a🅰b
a🅰b
a🅰b
a🅰b
a🅰b
a🅰b
a🅰b
a🅰b
a�b
a�b
a�b
a�b
􏿼t
t
t
􏿼t
�
�
�
�
􏿵
􏿵
􏿵t
t
t
􏿵t
�
�
�
�
a🅰x;b
a🅰x;b
a🅰x;b
a🅰x;b
aԹb
aԹb
aԹb
aԹb
aԹb
aԹb
aԹb
aԹb
a&b
a&b
a&b
a&b
a&b
a&b
a&b
a&b
a¬in a bind
a¬in a bind
a¬in a bind
a¬in a bind
a∉b
a∉b
a∉b
a∉b
a¬inb
a¬inb
a¬inb
a¬inb
Ă
Ă
Ă
Ă
&Abreve
&Abreve
&Abreve
&Abreve
Á
Á
Á
Á
Á
Á
Á
Á
Ála carte
Ála carte
Ála carte
Ála carte
Ála carte
Ála carte
Ála carte
Ála carte
Á la carte
Á la carte
Á la carte
Á la carte
Á=la carte
Á=la carte
Á=la carte
Á=la carte
Á*la carte
Á*la carte
Á*la carte
Á*la carte
€‡•œ
€‡•œ
€‡•œ
€‡•œ
&#t;
&#t;
&#t;
&#t;
��
��
��
��








	 


	 
&#x-65;
&#x-65;
&#x-65;
&#x-65;