Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Add support for missing FRAMESET and "after" insertion modes. #7165

Closed

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Aug 8, 2024

Trac ticket: Core-61576.

As part of work to add more spec support to the HTML API, this patch
adds support for the FRAMESET-related insertion modes, as well as the
set of missing after insertion modes. These modes run at the end of
parsing a document, closing it and taking care of any lingering tags.

Developed in https://github.com/wordpress/wordpress-develop
Discussed in https://core.trac.wordpress.org/ticket/61576

See #61576.

html5lib tests

-Tests: 1497, Assertions: 975, Skipped: 522.
+Tests: 1494, Assertions: 1074, Skipped: 420.

Copy link

github-actions bot commented Aug 8, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

As part of work to add more spec support to the HTML API, this patch
adds support for the FRAMESET-related insertion modes, as well as the
set of missing _after_ insertion modes. These modes run at the end of
parsing a document, closing it and taking care of any lingering tags.

Developed in WordPress#7165
Discussed in https://core.trac.wordpress.org/ticket/61576

See #61576.
@dmsnell dmsnell force-pushed the html-api/support-after-insertion-modes branch from 51894ef to a71c63f Compare August 8, 2024 19:36
@dmsnell
Copy link
Member Author

dmsnell commented Aug 8, 2024

@sirreal there are failures in the tests that look like BODY isn't closing. If you have insight, I'd appreciate it.

Copy link

github-actions bot commented Aug 8, 2024

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@sirreal
Copy link
Member

sirreal commented Aug 9, 2024

@sirreal there are failures in the tests that look like BODY isn't closing. If you have insight, I'd appreciate it.

I've been reviewing the specification and this may be very tricky! The body tag never seems to close, but things may be inserted outside of it 😵

When the BODY and HTML tags close, nothing changes the stack of open elements. This is different from most other tags and insertion mode transitions. There's no pop here!

An end tag whose tag name is "body"

Switch the insertion mode to "after body".

In "after body" (BODY closed) and "after after body" (HTML closed), comments are inserted in-place (outside BODY or HTML elements) but other things are handled using rules for "in body" insertion mode. The things that are inserted in place have special handling like "Insert a comment as the last child of the first element in the stack of open elements (the html element)" or "Insert a comment as the last child of the Document object."

This is fascinating. We basically operate just like we're in body with a few invisible exceptions that are inserted out of place. The stack of open elements even continues to be modified!

<i><s>
</body>
<!-- "after body" -->
body » i » s » #text
</s><em>
</html>
<!-- "after after body" -->
body » i » em » #text
</i></em>
body » #text

Screenshot 2024-08-09 at 10 33 11
Screenshot 2024-08-09 at 10 33 23


This is similar to TABLE elements, where some elements may be bumped before the table (foster parenting). Some things, mostly comments, are bumped to after BODY or HTML nodes.

It also has some similarity to FORM element closers that may be removed from the stack of open elements out-of-place.


I'm not sure how best to handle this 😕 Reviewing the spec, I believe that only comments are inserted outside of the BODY and HTML nodes. I wonder if we could collect these in-place comments in a couple of lists then iterate through them then the document is finished.

Update:

I had overlooked that "after body" and "after after body" actually switch to mode back to body and reprocess the token, so we effectively step out of body or html tags, have an opportunity to insert some comments, then switch back if anything else is found. We do not process the tokens using the rules for "in body".

Switch the insertion mode to "in body" and reprocess the token.

@sirreal
Copy link
Member

sirreal commented Aug 9, 2024

I explored some ideas for dealing with those comments in dmsnell#18.

For now, I'm going to push a change to this PR to switch the bail conditions in after body. It seems better to bail on comments outside of BODY and HTML nodes (we might even be able to skip this) and support a </body> tag in the middle of the document that has almost no impact if text or tags follow it.

sirreal added 8 commits August 9, 2024 13:42
Ideally, we could support all of this and only bail if the processor
prints a comment and then re-enters.
Some content can exist outside of BODY and HTML tags. See the after body
and after after body insertion modes.

These modes are difficult because the tree may be constructed
out-of-order. Disallow this through the use of a few flags that are used
to bail if possible out-of-order behavior is detected.

This allows some HTML to be processed as long as elements are
encountered in-order. Although the body tag may close, elements can
still be found in-order. For example:

`</html>x</body>y</body></!></html></!>` is not problematic because text
nodes are found and return to "in body" insertion mode before any
content is produced outside of the body or html nodes:

Document
  ├HTML
  │ ├── HEAD
  │ ├── /HEAD
  │ ├── BODY
  │ │    ├── #text: "x"
  │ │    └── #text: "y"
  │ ├── /BODY
  │ └── #funky-comment: "!"
  ├── /HTML
  └── #funky-comment: "!"

However, as soon as content is produced outside if body, the processor
will bail if it attempts to return to in body or to produce content
inside body tags. For example `</html></!></body>x` bails. It would
produce this tree:

Document
  ├HTML
  │ ├── HEAD
  │ ├── /HEAD
  │ └── BODY
  │      └── #text: "x"
  ├── /HTML
  └── #funky-comment: "!"

In this case, the `#funky-comment` was found first in the document root,
then a text node "x" was added to the body. These out-of-order nodes are
disallowed and bail.
These modes effectively ignore non-whitespace (not supported) and insert
whitespace text nodes under HTML node.

Bail if out-of-order behavior is detected.
/*
* > A comment token
*/
case '#cdata-section':
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: I'm not sure how these got in here, but they aren't comments, and I don't see a line in the spec for them. I'm thinking it could have been a copy/paste error I missed.

@dmsnell
Copy link
Member Author

dmsnell commented Aug 9, 2024

I'm not sure how best to handle this 😕 Reviewing the spec, I believe that only comments are inserted outside of the BODY and HTML nodes. I wonder if we could collect these in-place comments in a couple of lists then iterate through them then the document is finished.

This seems like a plausible idea, @sirreal. Presumably we could reuse the event queue, such that we track these comments in those lists, and next_token() would pull them out when it sees the POP event for BODY and then for HTML.

From a performance standpoint, I don't see this being that troublesome, especially if all we stored in those lists were the WP_HTML_Token or bookmark name.

@sirreal
Copy link
Member

sirreal commented Aug 12, 2024

I wonder if we could collect these in-place comments in a couple of lists then iterate through them then the document is finished.

This seems like a plausible idea, @sirreal. Presumably we could reuse the event queue, such that we track these comments in those lists, and next_token() would pull them out when it sees the POP event for BODY and then for HTML.

I implemented that in this PR in a1304b5. I reverted the change. I had issues getting the text for the comments outside of body because get_modifiable_text is a method on the Tag Processor and it expects to operate on the token it's parsing. It may be enough to seek to all the comments to get their text, I did not try that.

Simplify this PR to handle common scenarios and avoid deviation from specification.
Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be in a good place 👍 We get a good increase in HTML5lib-tests which also helps my confidence:

 OK, but incomplete, skipped, or risky tests!
-Tests: 1498, Assertions: 930, Skipped: 568.
+Tests: 1495, Assertions: 1026, Skipped: 469.

sirreal and others added 2 commits August 22, 2024 21:11
These should not have appeared as CDATA cannot appear in HTML.
@dmsnell
Copy link
Member Author

dmsnell commented Aug 23, 2024

Merged in [58926]
15dca4e

@dmsnell dmsnell closed this Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants