-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML API: Parse doctypes and set full parser quirks mode correctly #7195
Changes from 22 commits
42be2f6
d41d9b3
17aff1d
6d73db8
7550d10
2003f1d
a36ceeb
150d18c
16df2cd
7217cc7
b5c5cfa
deafee4
d359997
8d22b60
cda6f58
5aeadf2
601f53b
edba9e1
8250bcb
fee4b70
8ea4451
2ffff8d
edac23f
f38fe1c
65cca88
621afad
364d348
7578082
a68d5ca
78e8c64
ef734be
2a4807e
995129b
8e68dd6
c59993a
3bda3f5
dd9cb57
122e393
bab37e5
aa1912f
271681f
2a424e9
f76e4eb
907987e
452a98c
c44a0b6
4ab8c64
0cb32e9
84c1faa
3c4aa1d
fe6cac9
3db7230
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,269 @@ | ||
<?php | ||
/** | ||
* HTML API: WP_HTML_Doctype_Info class | ||
* | ||
* @package WordPress | ||
* @subpackage HTML-API | ||
* @since 6.7.0 | ||
*/ | ||
|
||
/** | ||
* Core class used by the HTML API processor to represent a DOCTYPE declaration. | ||
* | ||
* @since 6.7.0 | ||
*/ | ||
class WP_HTML_Doctype_Info { | ||
/** | ||
* The name of the DOCTYPE. | ||
* | ||
* @var string|null | ||
*/ | ||
private $name; | ||
|
||
/** | ||
* The public identifier of the DOCTYPE. | ||
* | ||
* @var string|null | ||
*/ | ||
private $public_identifier; | ||
|
||
/** | ||
* The system identifier of the DOCTYPE. | ||
* | ||
* @var string|null | ||
*/ | ||
private $system_identifier; | ||
|
||
/** | ||
* Whether the DOCTYPE token force-quirks flag is set. | ||
* | ||
* @var bool | ||
*/ | ||
private $force_quirks_flag; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. perhaps There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd prefer to keep this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've re-read some of your other related comments and I think I agree after all. I'm going to re-work some of this to establish the resulting quirks mode on construction instead of re-running the algorithm in the |
||
|
||
/** | ||
* Constructor. | ||
* | ||
* The arguments to this constructor correspond to the "DOCTYPE token" as defined in the | ||
* HTML specification. | ||
* | ||
* > DOCTYPE tokens have a name, a public identifier, a system identifier, | ||
* > and a force-quirks flag. | ||
* | ||
* @see https://html.spec.whatwg.org/multipage/parsing.html#tokenization | ||
* | ||
* @since 6.7.0 | ||
* | ||
* @param string|null $name The name of the DOCTYPE. | ||
* @param string|null $public_identifier The public identifier of the DOCTYPE. | ||
* @param string|null $system_identifier The system identifier of the DOCTYPE. | ||
* @param bool $force_quirks_flag Whether the DOCTYPE token force-quirks flag is set. | ||
*/ | ||
public function __construct( | ||
?string $name, | ||
?string $public_identifier, | ||
?string $system_identifier, | ||
bool $force_quirks_flag | ||
) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't love this constructor even though it's probably fine with Did you consider moving the parsing code inside of it? We could replace this with a static creator method like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. moving the parse of the token into this class could help resolve the when constructing, as soon as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I moved the parsing into the constructor, made it access-private, and added some notes that it should really be constructed from a doctype token's The parser now expects what's "in" the doc type, and assumes that whatever it is passed is correct. These classes become quite closely coupled. That may OK at this time given the narrow usefulness of this class (it's practically internal) and the documentation. Curious to hear your thoughts. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. instead of tightly coupling these and recreating some of the pains that Core already has, such as the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I almost passed the full token and then changed direction after looking at
Thinking about that more, I think it's OK to miss updates because I don't think there should be any updates. Changing the doctype seems very dangerous because it has the potential to alter the document's compat mode which could affect how it's parsed. I do have concerns about the failure mode. What happens if we don't get a valid doctype string? We can return a quirks mode doctype instance with all empty strings, but it would be nice to indicate somehow that the provided string wasn't a valid doctype. Would it be OK to throw an exception in that case given this constructor is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thankfully we don't currently support setting modifiable text for doctype declarations. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
it seems like if we can't parse it then we can't make any judgement about the document mode because it's not a real DOCTYPE declaration token. returning |
||
$this->name = $name; | ||
$this->public_identifier = $public_identifier; | ||
$this->system_identifier = $system_identifier; | ||
$this->force_quirks_flag = $force_quirks_flag; | ||
} | ||
|
||
/** | ||
* Gets the name of the DOCTYPE. | ||
* | ||
* @since 6.7.0 | ||
* | ||
* @return string The name of the DOCTYPE. | ||
*/ | ||
public function get_name(): string { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I almost think it would be more valuable to represent the absence of these identifiers, just as if I were writing a method to copy the token from a source document into a normalized output, however, I would want to know the difference. I also find it unlikely that userspace code is going to be doing much more with this than possibly copying HTML. I would like to see what people want or expect to do with the information, but it's not like we see people parsing DOCTYPE tokens already using the existing tools available |
||
return $this->name ?? ''; | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. given that this basically serves as a record class with an intelligent constructor, do you see much value in hiding these behind what if the properties were all There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it's OK to set these properties, that's one reason to use getters. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's also the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's not okay to set them, but also seems needless to recreate state-oriented interfaces that do nothing. I know "once public always public" occurs here, but I also see these as being closed to extension or updates. we can also document them saying "don't change these" but I don't know why someone would when the class doesn't do anything else. it doesn't support serialization and it shouldn't be re-parsing itself to change the suggested document mode. in essence, they are read only unless someone changes them and re-reads them, and I'm not sure what the point of that would be. |
||
|
||
/** | ||
* Gets the public identifier of the DOCTYPE. | ||
* | ||
* @since 6.7.0 | ||
* | ||
* @return string The public identifier of the DOCTYPE. | ||
*/ | ||
public function get_public_identifier(): string { | ||
return $this->public_identifier ?? ''; | ||
} | ||
|
||
/** | ||
* Gets the system identifier of the DOCTYPE. | ||
* | ||
* @since 6.7.0 | ||
* | ||
* @return string The system identifier of the DOCTYPE. | ||
*/ | ||
public function get_system_identifier(): string { | ||
return $this->system_identifier ?? ''; | ||
} | ||
|
||
/** | ||
* Determines the resulting document compatibility mode for this DOCTYPE. | ||
* | ||
* When a DOCTYPE appears in the appropriate place in a document, its contents determine | ||
* the compatibility mode of the document. This implements an algorithm described in the | ||
* HTML standard for handling a DOCTYPE token in the "initial" insertion mode. | ||
* | ||
* @see https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode | ||
* | ||
* @since 6.7.0 | ||
* | ||
* @return string A string indicating the resulting quirks mode. One of "quirks", | ||
* "limited-quirks", or "no-quirks". | ||
*/ | ||
public function get_compatibility_mode(): string { | ||
/* | ||
* > A system identifier whose value is the empty string is not considered missing for the | ||
* > purposes of the conditions above. | ||
*/ | ||
$system_identifier_is_missing = null === $this->system_identifier; | ||
|
||
/* | ||
* > The system identifier and public identifier strings must be compared to the values | ||
* > given in the lists above in an ASCII case-insensitive manner. A system identifier whose | ||
* > value is the empty string is not considered missing for the purposes of the conditions above. | ||
*/ | ||
$public_identifier = null === $this->public_identifier ? '' : strtolower( $this->public_identifier ); | ||
$system_identifier = null === $this->system_identifier ? '' : strtolower( $this->system_identifier ); | ||
|
||
/* | ||
* > [If] the DOCTYPE token matches one of the conditions in the following list, then set | ||
* > the Document to quirks mode: | ||
*/ | ||
|
||
// > The force-quirks flag is set to on. | ||
if ( $this->force_quirks_flag ) { | ||
return 'quirks'; | ||
} | ||
|
||
// > The name is not "html". | ||
if ( 'html' !== $this->name ) { | ||
return 'quirks'; | ||
} | ||
|
||
// > The public identifier is set to… | ||
if ( | ||
'-//w3o//dtd w3 html strict 3.0//en//' === $public_identifier || | ||
'-/w3c/dtd html 4.0 transitional/en' === $public_identifier || | ||
'html' === $public_identifier | ||
) { | ||
return 'quirks'; | ||
} | ||
|
||
// > The system identifier is set to… | ||
if ( 'http://www.ibm.com/data/dtd/v11/ibmxhtml1-transitional.dtd' === $system_identifier ) { | ||
return 'quirks'; | ||
} | ||
|
||
/* | ||
* All of the following conditions depend on matching the public identifier. | ||
* If the public identifier is falsy, none of the following conditions will match. | ||
*/ | ||
if ( '' === $public_identifier ) { | ||
return 'no-quirks'; | ||
} | ||
|
||
// > The public identifier starts with… | ||
if ( | ||
str_starts_with( $public_identifier, '+//silmaril//dtd html pro v0r11 19970101//' ) || | ||
str_starts_with( $public_identifier, '-//as//dtd html 3.0 aswedit + extensions//' ) || | ||
str_starts_with( $public_identifier, '-//advasoft ltd//dtd html 3.0 aswedit + extensions//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html 2.0 level 1//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html 2.0 level 2//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html 2.0 strict level 1//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html 2.0 strict level 2//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html 2.0 strict//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html 2.0//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html 2.1e//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html 3.0//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html 3.2 final//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html 3.2//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html 3//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html level 0//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html level 1//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html level 2//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html level 3//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html strict level 0//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html strict level 1//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html strict level 2//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html strict level 3//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html strict//' ) || | ||
str_starts_with( $public_identifier, '-//ietf//dtd html//' ) || | ||
str_starts_with( $public_identifier, '-//metrius//dtd metrius presentational//' ) || | ||
str_starts_with( $public_identifier, '-//microsoft//dtd internet explorer 2.0 html strict//' ) || | ||
str_starts_with( $public_identifier, '-//microsoft//dtd internet explorer 2.0 html//' ) || | ||
str_starts_with( $public_identifier, '-//microsoft//dtd internet explorer 2.0 tables//' ) || | ||
str_starts_with( $public_identifier, '-//microsoft//dtd internet explorer 3.0 html strict//' ) || | ||
str_starts_with( $public_identifier, '-//microsoft//dtd internet explorer 3.0 html//' ) || | ||
str_starts_with( $public_identifier, '-//microsoft//dtd internet explorer 3.0 tables//' ) || | ||
str_starts_with( $public_identifier, '-//netscape comm. corp.//dtd html//' ) || | ||
str_starts_with( $public_identifier, '-//netscape comm. corp.//dtd strict html//' ) || | ||
str_starts_with( $public_identifier, "-//o'reilly and associates//dtd html 2.0//" ) || | ||
str_starts_with( $public_identifier, "-//o'reilly and associates//dtd html extended 1.0//" ) || | ||
str_starts_with( $public_identifier, "-//o'reilly and associates//dtd html extended relaxed 1.0//" ) || | ||
str_starts_with( $public_identifier, '-//sq//dtd html 2.0 hotmetal + extensions//' ) || | ||
str_starts_with( $public_identifier, '-//softquad software//dtd hotmetal pro 6.0::19990601::extensions to html 4.0//' ) || | ||
str_starts_with( $public_identifier, '-//softquad//dtd hotmetal pro 4.0::19971010::extensions to html 4.0//' ) || | ||
str_starts_with( $public_identifier, '-//spyglass//dtd html 2.0 extended//' ) || | ||
str_starts_with( $public_identifier, '-//sun microsystems corp.//dtd hotjava html//' ) || | ||
str_starts_with( $public_identifier, '-//sun microsystems corp.//dtd hotjava strict html//' ) || | ||
str_starts_with( $public_identifier, '-//w3c//dtd html 3 1995-03-24//' ) || | ||
str_starts_with( $public_identifier, '-//w3c//dtd html 3.2 draft//' ) || | ||
str_starts_with( $public_identifier, '-//w3c//dtd html 3.2 final//' ) || | ||
str_starts_with( $public_identifier, '-//w3c//dtd html 3.2//' ) || | ||
str_starts_with( $public_identifier, '-//w3c//dtd html 3.2s draft//' ) || | ||
str_starts_with( $public_identifier, '-//w3c//dtd html 4.0 frameset//' ) || | ||
str_starts_with( $public_identifier, '-//w3c//dtd html 4.0 transitional//' ) || | ||
str_starts_with( $public_identifier, '-//w3c//dtd html experimental 19960712//' ) || | ||
str_starts_with( $public_identifier, '-//w3c//dtd html experimental 970421//' ) || | ||
str_starts_with( $public_identifier, '-//w3c//dtd w3 html//' ) || | ||
str_starts_with( $public_identifier, '-//w3o//dtd w3 html 3.0//' ) || | ||
str_starts_with( $public_identifier, '-//webtechs//dtd mozilla html 2.0//' ) || | ||
str_starts_with( $public_identifier, '-//webtechs//dtd mozilla html//' ) | ||
) { | ||
return 'quirks'; | ||
} | ||
|
||
// > The system identifier is missing and the public identifier starts with… | ||
if ( | ||
$system_identifier_is_missing && ( | ||
str_starts_with( $public_identifier, '-//w3c//dtd html 4.01 frameset//' ) || | ||
str_starts_with( $public_identifier, '-//w3c//dtd html 4.01 transitional//' ) | ||
) | ||
) { | ||
return 'quirks'; | ||
} | ||
|
||
/* | ||
* > Otherwise, [if] the DOCTYPE token matches one of the conditions in | ||
* > the following list, then set the Document to limited-quirks mode. | ||
*/ | ||
|
||
// > The public identifier starts with… | ||
if ( | ||
str_starts_with( $public_identifier, '-//w3c//dtd xhtml 1.0 frameset//' ) || | ||
str_starts_with( $public_identifier, '-//w3c//dtd xhtml 1.0 transitional//' ) | ||
) { | ||
return 'limited-quirks'; | ||
} | ||
|
||
// > The system identifier is not missing and the public identifier starts with… | ||
if ( | ||
! $system_identifier_is_missing && ( | ||
str_starts_with( $public_identifier, '-//w3c//dtd html 4.01 frameset//' ) || | ||
str_starts_with( $public_identifier, '-//w3c//dtd html 4.01 transitional//' ) | ||
) | ||
) { | ||
return 'limited-quirks'; | ||
} | ||
|
||
return 'no-quirks'; | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -459,6 +459,24 @@ private function bail( string $message ) { | |
throw $this->unsupported_exception; | ||
} | ||
|
||
/** | ||
* Sets the document compatibility mode (quirks or no-quirks) based on a DOCTYPE declaration. | ||
* | ||
* @see https://html.spec.whatwg.org/multipage/parsing.html#parser-cannot-change-the-mode-flag | ||
* | ||
* @since 6.7.0 | ||
*/ | ||
private function update_document_mode_from_doctype(): void { | ||
$doctype = $this->get_doctype_info(); | ||
if ( null === $doctype ) { | ||
return; | ||
} | ||
|
||
if ( 'quirks' === $doctype->get_compatibility_mode() ) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this reads a bit awkward to me. also one of the reasons I find direct property access helpful. the DOCTYPE token doesn't specify a document's compatibility mode - it can only suggest it when none is already provided or forced, right? so we might say "this token indicates quirks mode" or "this token indicates no-quirks mode" if ( 'quirks' === $doctype->indicated_compatability_mode ) { } but on this note do we want to constrain this function to only operate on a DOCTYPE token? I would imagine this could lead to people needing to try and push their own doctype declaration into existing HTML, when they could instead set this instead $processor->set_compatability_mode( WP_HTML_Tag_Processor::QUIRKS_MODE );
$processor->set_compatability_mode( WP_HTML_Tag_Processor::NO_QUIRKS_MODE ); Although this does in a sense expose an opportunity for a foot-gun, for someone to change the compatibility mode randomly, it doesn't force new rules onto the Tag Processor that I don't think fully exist there. we're not going to have a token in the fragment case you mention, so maybe we analyze the parsed DOCTYPE in the HTML Processor and then change modes in the Tag Processor via method call. we can even |
||
$this->state->document_mode = WP_HTML_Processor_State::QUIRKS_MODE; | ||
} | ||
} | ||
|
||
/** | ||
* Returns the last error, if any. | ||
* | ||
|
@@ -1076,19 +1094,12 @@ private function step_initial(): bool { | |
* > A DOCTYPE token | ||
*/ | ||
case 'html': | ||
$contents = $this->get_modifiable_text(); | ||
if ( ' html' !== $contents ) { | ||
/* | ||
* @todo When the HTML Tag Processor fully parses the DOCTYPE declaration, | ||
* this code should examine the contents to set the compatability mode. | ||
*/ | ||
$this->bail( 'Cannot process any DOCTYPE other than a normative HTML5 doctype.' ); | ||
} | ||
|
||
$this->update_document_mode_from_doctype(); | ||
/* | ||
* > Then, switch the insertion mode to "before html". | ||
*/ | ||
$this->state->insertion_mode = WP_HTML_Processor_State::INSERTION_MODE_BEFORE_HTML; | ||
$this->insert_html_element( $this->state->current_token ); | ||
return true; | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just guessing but I bet a lot could be added to these properties and this class to educate people on the value (or danger) of these parts of the DOCTYPE. Just a comment saying something like "This exists mostly to be able to represent and copy HTML from one document from another, but it's best to use only the standard
<!DOCTYPE html>
declaration for HTML."just doodling