Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css-fonts] Drop bracket matching step from @font-face src: line parsing #6340

Closed
drott opened this issue Jun 2, 2021 · 21 comments
Closed
Assignees
Labels
Closed Accepted by Editor Discretion Commenter Satisfied Commenter has indicated satisfaction with the resolution / edits. css-fonts-4 Current Work Needs Testcase (WPT)

Comments

@drott
Copy link
Collaborator

drott commented Jun 2, 2021

For context, the original resolution from which the spec text for parsing the src: line of the @font-face descriptor was tracked in issue #633. This issue is related to #6328.

The current spec text says:

The value of the 'src!!descriptor' descriptor is parsed piecewise. First, opening and closing parentheses are matched. If these parentheses cannot be matched, the value is a parse error. Matching the parentheses has the effect of partitioning the value into multiple regions, each of which is either inside outermost parentheses or outside outermost parentheses. Once the value is partitioned, all the commas outside outermost parentheses are located. The value is then split at these comma locations.

Why do we need the bracket matching? In my quick analysis, at least WebKit and Blink do not do that but the top level step for parsing the src: descriptors value is to split this into a comma separated list of values, then parse these individually. I am not so familiar with reading Rust code, but it roughly looks like Gecko parses this as a comma separated list at the top level as well? (@jfkthame, could you comment?).

I suggest to drop this text and explain that the first segmentation step is separating by commas, then parsing each individual value separately, unless I am missing something on why this might be needed. To me it seems this would not conflict with any of the second level parsing of individual source values. (It does not look to me as commas would have any other meaning in this descriptor except separating sources, do they?)

This would also address the TODO issue in the text and explain that after comma separation, values that are not parsed correctly are dropped from the list and not considered as downloadable resources.

@svgeesus
Copy link
Contributor

svgeesus commented Jun 2, 2021

Right, because after segmentation on commas the procedure is

Each item in the split value is then parsed against this grammar:

and the grammar includes parentheses. Also Fonts 3 didn't define parsing this way so I suspect implementations just extended what they already had for Fonts 3.

@litherum you added that section could you explain the reasoning?

@svgeesus svgeesus added the css-fonts-4 Current Work label Jun 2, 2021
@svgeesus
Copy link
Contributor

svgeesus commented Jun 2, 2021

Also this resolution:

RESOLVED: Parse the value of src throwing out invalid parts like media queries and not like selectors. aka You split on the commas and throw out the pieces you don't understand not the whole thing.

@svgeesus
Copy link
Contributor

svgeesus commented Jun 2, 2021

I suspect the split on parantheses came about because at one point we were considering syntax like

@font-face {
  font-family: heading-font;
  src: url(fancy-font.woff2) format("woff2;tables=CPAL,FVAR"),
        url(fallback-font.woff2) format("woff2"),
        url(fallback-fallback-font.woff) format("woff"),
        url(how-old-is-your-browser-font.ttf) format("ttf");
}

Note the commas inside the overloaded format specifier.

But we ended up not doing that, so segmenting first on balanced parentheses no longer has any value.

@svgeesus svgeesus self-assigned this Jun 2, 2021
@svgeesus
Copy link
Contributor

svgeesus commented Jun 2, 2021

Agenda+ in case there are any good arguments in favor of the balanced-parens initial step. Otherwise, I'm fine dropping that part to improve internal spec consistency and also better align with actual implementations

@jfkthame
Copy link
Contributor

jfkthame commented Jun 2, 2021

If the first step is to split on commas, what happens if there's a URL that includes a comma?

@svgeesus
Copy link
Contributor

svgeesus commented Jun 2, 2021

comma is in the set sub-delims which also says

If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.

I guess that was never an issue for CSS Fonts 3, which didn't have the split-on-balanced-parens step?

@jfkthame
Copy link
Contributor

jfkthame commented Jun 2, 2021

I don't think @drott's original comment here:

In my quick analysis, at least WebKit and Blink do not do that but the top level step for parsing the src: descriptors value is to split this into a comma separated list of values, then parse these individually

can be quite accurate, given that an example like:

@font-face {
  font-family: test;
  src: url(data:font/ttf;base64,AAEA[...snipped...]QjUA);
}

successfully loads the base64-encoded font resource in all of Safari, Chrome and Firefox. If they were splitting the src: value on comma as a top-level step, surely that wouldn't work. They'd end up seeing two items, neither of them complete or parsable.

@tabatkins
Copy link
Member

I'll note that I added an algo to Syntax for this exact case (for other reasons) that behaves correctly, so the spec can just invoke that: https://drafts.csswg.org/css-syntax/#parse-comma-separated-list-of-component-values

@tabatkins
Copy link
Member

tabatkins commented Jun 2, 2021

(CSS specifications can't specify text-level parsing details anyway; by the time they have a value that they know is part of a particular property, it's already been tokenized and had basic parsing done on it, so all you've got is tokens, not characters. The above algo works at the correct level.)

@drott
Copy link
Collaborator Author

drott commented Jun 3, 2021

Thanks @jfkthame and @tabatkins for observations and clarifications. Of course, Jonathan, you're right that the data url scheme you're giving as an example does work. And it's also true that the CSS parsing operates on a token stream, not on a character stream.

What I looked at during my analysis for WebKit and Chrome were these bits of code for where a font face src descriptor is parsed:
ConsumeFontFaceSrc Chrome and consumeFontFaceSrc in WebKit

In particular, the behavior of the parser here consuming and progressing to each next value comma-token by comma-token. But indeed while still inner component values such as url() can contain commas which are consumed in nested parsing functions called consumeFontFaceSrcLocal and consumeFontFaceURI. So my earlier description of first separating by commas is not quite accurate.

With that clarified, I am still convinced that the prose for separating by bracket matching should be dropped as that's definitely not what's happening or needed.

Because @tabatkins provides this very useful reference, as an additional improvement to the current text, we may want to refer to "5.3.10. Parse a comma-separated list of component values" as the algorithm for parsing this value, assuming that it's fully applicable: In that respect, are url(...) and local(...) defined as function tokens?

@tabatkins
Copy link
Member

tabatkins commented Jun 3, 2021

Yeah, per the Syntax spec, they definitely get parsed as function tokens. (Depending on its contents, url() might get parsed as a url-token, but for these purposes that's identical.)

A little more precisely, by the time this algo can get called, they're already processed into functions, a very slightly higher-level object that contains all the stuff between their parens, so a pass over the top-level component values won't see any commas inside of there unless you recurse manually.

@litherum
Copy link
Contributor

litherum commented Jun 9, 2021

The idea was just that I didn’t want “foo, bar(baz, quux)” to be broken into 3 pieces: “foo” “bar(baz” and “quux)”. If there’s logic elsewhere that already satisfies this desire, I’m happy to use it instead.

WebKit hasn’t implemented parsing support for this new src: descriptor grammar yet.

@drott
Copy link
Collaborator Author

drott commented Jun 9, 2021

Thanks for commenting, @litherum. To summarize, I suggest to improve this section by dropping

The value of the 'src!!descriptor' descriptor is parsed piecewise. First, opening and closing parentheses are matched. If these parentheses cannot be matched, the value is a parse error. Matching the parentheses has the effect of partitioning the value into multiple regions, each of which is either inside outermost parentheses or outside outermost parentheses.

Once the value is partitioned, all the commas outside outermost parentheses are located. The value is then split at these comma locations.

Each item in the split value is then parsed against this grammar:

and instead say:

The src: descriptor value must be parsed according to section "5.3.10. Parse a comma-separated list of component values". Then each component value is parsed according to this grammar:

I also suggest to address "ISSUE 4" and the definition of "thrown out" by saying:

If a component value is parsed correctly and is of a format and font technology that the UA supports, add it to the list of supported sources. If parsing a component value results in a parsing error or its format or technology are unsupported, do not add it to the list of supported sources.

If there are no supported entries at the end of this process, the value for the src: descriptor is a parse error.

@svgeesus
Copy link
Contributor

svgeesus commented Jun 9, 2021

Thanks for the suggested wording, @drott I just added it.

@svgeesus
Copy link
Contributor

svgeesus commented Jun 9, 2021

This issue is still on the agenda for today's CSS call, but unless there are problems with the suggested wording then it looks like we can just close the issue as resolved?

It would be good to have some known-invalid-component src parsing tests in WPT.

@drott
Copy link
Collaborator Author

drott commented Jun 9, 2021

Yes, I support resolving and closing with this change. Unfortunately I have a conflict and can't attend today's call, but at least @lilles said they'd be there from our side.

I plan to add tests in parallel to implementing the supports syntax and mechanism in Chrome. I'll add tests for known-invalid-components as you suggest.

@svgeesus
Copy link
Contributor

svgeesus commented Jun 9, 2021

@jfkthame okay to close this?

@svgeesus
Copy link
Contributor

svgeesus commented Jun 9, 2021

@jfkthame
Copy link
Contributor

jfkthame commented Jun 9, 2021

Yes, sounds OK to me.

@svgeesus svgeesus closed this as completed Jun 9, 2021
@svgeesus svgeesus added Closed Accepted by Editor Discretion Commenter Satisfied Commenter has indicated satisfaction with the resolution / edits. and removed Agenda+ labels Jun 9, 2021
@tabatkins
Copy link
Member

+1 to committed change, but @svgeesus INCONSISTENT TABS AND SPACE INDENTATION???

You could also just link the algorithm directly with [=parse a comma-separated list of component values=], but that's a nit.

@svgeesus
Copy link
Contributor

Linking to the algorithm directly might well have been better, yes.

I'm aware that Tabs vs. Spaces is one of the GREAT HOLY WARS OF OUR TIME and I am really not interested in signing up as a combatant.

chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Jun 19, 2023
Adapt font-face src: parsing behavior to more leniently accept a src:
line that has parse errors in an individual component. Part of Interop
2023 in the font feature detection and palettes category [1].

Also return serialized values in lowercase to match CSSOM spec.

After CSS WG resolution [1] and updated spec text [2] individually
failing src: line components for incorrect format() or tech(), or other
parse failures per component should not lead to dropping the whole src:
line, but only reject the individual component.

Implement checks for component sanity and on failure, skip to next comma
or end.

[1] https://wpt.fyi/results/css/css-fonts/parsing?q=label%3Ainterop-2023-fonts
[1] w3c/csswg-drafts#6340
[2] https://drafts.csswg.org/css-fonts-4/#font-face-src-parsing

Fixed: 1398933, 857226
Change-Id: Ib94f745fb8b453e21fd981973c29dccb4b722439
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Jun 19, 2023
Adapt font-face src: parsing behavior to more leniently accept a src:
line that has parse errors in an individual component. Part of Interop
2023 in the font feature detection and palettes category [1].

Also return serialized values in lowercase to match CSSOM spec.

After CSS WG resolution [1] and updated spec text [2] individually
failing src: line components for incorrect format() or tech(), or other
parse failures per component should not lead to dropping the whole src:
line, but only reject the individual component.

Implement checks for component sanity and on failure, skip to next comma
or end.

[1] https://wpt.fyi/results/css/css-fonts/parsing?q=label%3Ainterop-2023-fonts
[1] w3c/csswg-drafts#6340
[2] https://drafts.csswg.org/css-fonts-4/#font-face-src-parsing

Fixed: 1398933, 857226
Change-Id: Ib94f745fb8b453e21fd981973c29dccb4b722439
aarongable pushed a commit to chromium/chromium that referenced this issue Jun 21, 2023
Adapt font-face src: parsing behavior to more leniently accept a src:
line that has parse errors in an individual component. Part of Interop
2023 in the font feature detection and palettes category [1].

Also return serialized values in lowercase to match CSSOM spec.

After CSS WG resolution [1] and updated spec text [2] individually
failing src: line components for incorrect format() or tech(), or other
parse failures per component should not lead to dropping the whole src:
line, but only reject the individual component.

Implement checks for component sanity and on failure, skip to next comma
or end.

Test updates:
* fast/css/font-face-src-parsing.html updated to account for more cases
  being allowed. For example, trailing commas in a src: line are not a
  full parse failure.
* fast/css/font-face-unquoted-local.html removed. The test is too
  restrictive after the behavior change, and relatively trivial and
  already covered in WPT tests, for example
  font-format-src-parsing.html.
* fast/css/fontfaceset-multiple-families.html updated to account for a
  @font-face with format("svg") being rejected early and thus not
  becoming part of the FontFaceSet of the page.
* external/wpt/css/css-fonts/format-specifiers-variations.html disabled
  until review in web-platform-tests/wpt#40666
  is complete. Test seems to be expecting an incorrect error type,
  NetworkError vs. SyntaxError.

[1] https://wpt.fyi/results/css/css-fonts/parsing?q=label%3Ainterop-2023-fonts
[1] w3c/csswg-drafts#6340
[2] https://drafts.csswg.org/css-fonts-4/#font-face-src-parsing

Fixed: 1398933, 857226
Change-Id: Ib94f745fb8b453e21fd981973c29dccb4b722439
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4624052
Reviewed-by: Rune Lillesveen <futhark@chromium.org>
Commit-Queue: Dominik Röttsches <drott@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1160567}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closed Accepted by Editor Discretion Commenter Satisfied Commenter has indicated satisfaction with the resolution / edits. css-fonts-4 Current Work Needs Testcase (WPT)
Projects
None yet
Development

No branches or pull requests

5 participants