-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML API: Parse doctypes and set full parser quirks mode correctly #7195
HTML API: Parse doctypes and set full parser quirks mode correctly #7195
Conversation
Similar to text nodes, this change adds DOCTYPE tokens to the stack of open elements so they can be reached when stepping through the document via `next_token`.
This method handles parsing the doctype name from a doctype declaration. This is important for the full HTML processor to be able to correctly determin whether it is in quirks mode.
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
return $doctype[0]; | ||
} | ||
|
||
public function parse_doctype(): ?array { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be protected, not public. We can set doctype details on processor state and add getters for them. Then we can do this parsing just once when the doctype is reached.
Maybe there's a better place for this parsing to happen.
There are some discrepancies with browser behavior to work out.
> Anything else > … set the Document to quirks mode.
Simplify tag processor doctype tests
Change the DOCTYPE status to suggest get_doctype_info over modifiable text. It's a bit confusing because doctypes cannot set their "modifiable" text, which makes the name modifiable awkward. It's unlikely this will be supported because most docyptes are skipped, while other doctypes change how a document is parsed.
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN:
To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sirreal this is quality work of the level I've come to expect from you. thank you so much!
in my latest push I've made some renames and documentation updates, plus added a special-case for the normative <!DOCTYPE html>
.
if you have any issues with these we can address them in a follow-up, but as we discussed out-of-band, I intend to merge this when the tests pass and when I'm able.
in follow-up work we can examine how the document compatability mode interact with the Tag Processor and CSS functions.
Thank you! I've reviewed your changes and I'm happy with them. |
…er-doctype-quirks-mode-handling
This patch adds until-now missing code to parse the structure of HTML DOCTYPE declarations. The DOCTYPE is mostly unused but can dictate the document compatability mode, which governs whether CSS class names match in a ASCII-case-insensitive way or not, and whether TABLE elements close an open P element. The DOCTYPE information is made available through a new method on the Tag Processor, `get_doctype_info()`. Developed in #7195 Discussed in https://core.trac.wordpress.org/ticket/61576 Props dmsnell, jonsurrell. See #61576. git-svn-id: https://develop.svn.wordpress.org/trunk@58925 602fd350-edb4-49c9-b593-d223f7449a82
This patch adds until-now missing code to parse the structure of HTML DOCTYPE declarations. The DOCTYPE is mostly unused but can dictate the document compatability mode, which governs whether CSS class names match in a ASCII-case-insensitive way or not, and whether TABLE elements close an open P element. The DOCTYPE information is made available through a new method on the Tag Processor, `get_doctype_info()`. Developed in WordPress/wordpress-develop#7195 Discussed in https://core.trac.wordpress.org/ticket/61576 Props dmsnell, jonsurrell. See #61576. Built from https://develop.svn.wordpress.org/trunk@58925 git-svn-id: http://core.svn.wordpress.org/trunk@58321 1a063a9b-81f0-0310-95a4-ce76da25c4cd
This patch adds until-now missing code to parse the structure of HTML DOCTYPE declarations. The DOCTYPE is mostly unused but can dictate the document compatability mode, which governs whether CSS class names match in a ASCII-case-insensitive way or not, and whether TABLE elements close an open P element. The DOCTYPE information is made available through a new method on the Tag Processor, `get_doctype_info()`. Developed in WordPress/wordpress-develop#7195 Discussed in https://core.trac.wordpress.org/ticket/61576 Props dmsnell, jonsurrell. See #61576. Built from https://develop.svn.wordpress.org/trunk@58925 git-svn-id: https://core.svn.wordpress.org/trunk@58321 1a063a9b-81f0-0310-95a4-ce76da25c4cd
WP_HTML_Doctype_Info
that handles parsing DOCTYPE tokens and exposes information about them.46 skipped tests from HTML5lib are now run. 1 test was disabled due to the way some whitespace is handled in the full parser.
This change adds a new class to handle DOCTYPE token information according to the specification. The class is exposed from Tag and HTML processors when a DOCTYPE token is reached. DOCTYPE token information can be retrieved for inspection by calling
$processor->get_doctype_info();
. See this example form the HTML5lib-tests:wordpress-develop/tests/phpunit/tests/html-api/wpHtmlProcessorHtml5lib.php
Lines 192 to 200 in 2ffff8d
The new class parses DOCTYPE Tokens in greater detail. This is useful because DOCTYPE tokens may appear in many places in HTML but are ignores in most situations. The detailed parsing of DOCTYPE tokens to be handled on-demand when a DOCTYPE token is reached under the appropriate circumstances.
The
WP_HTML_Doctype_Info
class also handles the complex rules for determining quirks mode which involve inspecting the DOCTYPE token name, public identifier, system identifier, and force_quirks_flag.Trac ticket: https://core.trac.wordpress.org/ticket/61576
Survey of existing DOCTYPE declarations
Download the DOCTYPE report and
cat report-doctypes.txt
to see the color output.Here is a preview:
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.