Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manifest processing model, what if null base URL? (related to origin issue) #12

Closed
danielweck opened this issue Dec 5, 2018 · 17 comments

Comments

@danielweck
Copy link
Member

Issue originally raised in the "opaque origin" conversation:
w3c/wpub#321 (comment)

@iherman
Copy link
Member

iherman commented Dec 5, 2018

If the manifest is embedded, the only way this can happen (see w3c/wpub#321 (comment)) is if the value of baseURI in the DOM for the <script> element is null. The question is when would that happen per the HTML or the DOM specs. I do not have a precise answer, but I suspect that it may happen in the case of a file: URL, ie, when the entry page is read from the file system. If, as we referred to in w3c/wpub#321, we disallow that (or we just say the effect depends on user agent and users should be prepared) then we are done, aren't we?

@iherman
Copy link
Member

iherman commented Dec 5, 2018

One step further in https://www.w3.org/TR/DOM-Level-3-Core/core.html#Node3-baseURI:

baseURI of type DOMString, readonly, introduced in DOM Level 3
The absolute base URI of this node or null if the implementation wasn't able to obtain an absolute URI.

@iherman
Copy link
Member

iherman commented Dec 5, 2018

Related to the original question: I am fine modifying the processing model stating that if this happens, the processing stops.

@danielweck
Copy link
Member Author

I my original comment I mentioned data: URLs. I believe this is a problematic edge case.
w3c/wpub#321 (comment)

@iherman
Copy link
Member

iherman commented Dec 5, 2018

@danielweck I must admit I do not understand your remark with the data URL. Can you give a somewhat more detailed example of what this would be and mean?

@danielweck
Copy link
Member Author

In the following edge case example, the data: URL encodes an HTML document which does not specify <base href="..."> in its head. Consequently, the <script>-embedded WebPub manifest has a null base URI (inherited from its parent document context).

Please ignore the lack of character escaping, this is pseudo-code:

https://domain.org/index.html
=>

<html>
<body>
<iframe
    src="data:text/html,<html><head><script type="application/ld+json">{...}</script></head><body>...</body></html>"
/>
</body>
</html>

Let's not try to explain why such convoluted markup would exist in the first place. Let's just handle the edge case regardless of its possible causes. I see two options:

  1. Early termination: no point continuing to load the WebPub manifest without a base URI. If I understand correctly, at this point in time the JSON-LD processing model is being discussed / finalized, with respect to handling base URI in embedded contexts. However the WebPub processing model can isolate itself from this potentially moving target, by aborting as soon as the failure criterion is met.
  2. Allow the WebPub manifest to load: if/when a base URI is required as part of the JSON-LD processing model in order to resolve an absolute URL from a relative "path", and this base URI is missing, then let the JSON-LD processor raise the appropriate error. This may be a complete abort, of a skip-resource-and-continue kind of algorithm (I am not sure, do you know Ivan?)

@iherman
Copy link
Member

iherman commented Dec 5, 2018

(2) is of course sounds as a viable and reasonable option, except that I would expect many reading systems would want to parse and interpret the manifest directly for the purposes of publications without relying on a full-blown json-ld processor. I.e., relying on that may be an issue.

On (1) yes, there are discussions on the JSON-LD but on (other) edge cases of embedding a manifest (e.g., is it required to escape certain HTML terms within the script element). I actually do not think this type of edge case has been discussed or not. Yes, the WebPub model can isolate itself, but I would think it is better to align with the JSON-LD WG.

Bottom line, I think this question should be raised in the JSON-LD WG. I can of course raise the issue, but it may be better if you did it (on https://github.com/w3c/json-ld-syntax/issues).

Do you know what will the baseURI value be on the DOM element for <script>? Will it be null (which I expect to be)?

@danielweck
Copy link
Member Author

Quick test:

<html>
<body>
<iframe
    width="100%"
    height="100%"

    src="data:text/html;base64,CjxodG1sPgo8aGVhZD4KPGJhc2UgaHJlZj0iaHR0cHM6Ly9kb21haW4ub3JnL3BhdGgvIiAvPgoKPHNjcmlwdCBpZD0ic2NyaXB0IiB0eXBlPSJ0ZXh0L2phdmFzY3JpcHQiPgogIGRvY3VtZW50LmFkZEV2ZW50TGlzdGVuZXIoIkRPTUNvbnRlbnRMb2FkZWQiLCBmdW5jdGlvbihldmVudCkgewogICAgY29uc29sZS5sb2coIkRPTUNvbnRlbnRMb2FkZWQiKTsKICAgIAogICAgLy8gd2luZG93LmxvY2F0aW9uLm9yaWdpbiB0b28KICAgIGxldCB0MSA9ICJ3aW5kb3cub3JpZ2luOiAiICsgd2luZG93Lm9yaWdpbjsKICAgIGNvbnNvbGUubG9nKHQxKTsKICAgIGRvY3VtZW50LmdldEVsZW1lbnRCeUlkKCJfMSIpLmlubmVySFRNTCA9IHQxOwogICAgCiAgICBsZXQgdDIgPSAiZG9jdW1lbnQuYmFzZVVSSTogIiArIGRvY3VtZW50LmJhc2VVUkk7CiAgICBjb25zb2xlLmxvZyh0Mik7CiAgICBkb2N1bWVudC5nZXRFbGVtZW50QnlJZCgiXzIiKS5pbm5lckhUTUwgPSB0MjsKCiAgICBsZXQgdDMgPSAibG9jYXRpb24uaHJlZjogIiArIGxvY2F0aW9uLmhyZWY7CiAgICBjb25zb2xlLmxvZyh0Myk7CiAgICBkb2N1bWVudC5nZXRFbGVtZW50QnlJZCgiXzMiKS5pbm5lckhUTUwgPSB0MzsKCiAgICBsZXQgdDQgPSAic2NyaXB0LmJhc2VVUkk6ICIgKyBkb2N1bWVudC5nZXRFbGVtZW50QnlJZCgic2NyaXB0IikuYmFzZVVSSTsKICAgIGNvbnNvbGUubG9nKHQ0KTsKICAgIGRvY3VtZW50LmdldEVsZW1lbnRCeUlkKCJfNCIpLmlubmVySFRNTCA9IHQ0OwogIH0pOwo8L3NjcmlwdD4KPC9oZWFkPgo8Ym9keT4KPGgxIGlkPSJfMSI+MTwvaDE+CjxoMSBpZD0iXzMiPjM8L2gxPgo8aDEgaWQ9Il8yIj4yPC9oMT4KPGgxIGlkPSJfNCI+NDwvaDE+CjwvYm9keT4KPC9odG1sPg=="
/>
</body>
</html>
<!--
<html>
<head>
<base href="https://domain.org/path/" />

<script id="script" type="text/javascript">
  document.addEventListener("DOMContentLoaded", function(event) {
    console.log("DOMContentLoaded");
    
    // window.location.origin too
    let t1 = "window.origin: " + window.origin;
    console.log(t1);
    document.getElementById("_1").innerHTML = t1;
    
    let t2 = "document.baseURI: " + document.baseURI;
    console.log(t2);
    document.getElementById("_2").innerHTML = t2;

    let t3 = "location.href: " + location.href;
    console.log(t3);
    document.getElementById("_3").innerHTML = t3;

    let t4 = "script.baseURI: " + document.getElementById("script").baseURI;
    console.log(t4);
    document.getElementById("_4").innerHTML = t4;
  });
</script>
</head>
<body>
<h1 id="_1">1</h1>
<h1 id="_3">3</h1>
<h1 id="_2">2</h1>
<h1 id="_4">4</h1>
</body>
</html>
-->

Result:

window.origin: null

location.href: data:text/html;base64,LONG_BASE64_STRING

document.baseURI: https://domain.org/path/

script.baseURI: https://domain.org/path/

If the <base href="https://domain.org/path/" /> element is removed, then baseURI for both document and script is in fact not null, it is the same as location.href (i.e. the data: URL) ... which cannot be used for resolving absolute URLs from relative paths anywhere in the document (such as when processing an embedded WebPub manifest).

Based on this simple experiment, I am starting to wonder whether ; just like opaque origin ; the WebPub specification should simply remain silent about baseURI edge cases. Once again, I think that the rationale for explicitly null-testing origin/baseURI (e.g. fail => terminate) in the WP manifest acquisition algorithm should be that origin/baseURI is explicitly needed later in the algorithm. For origin, the processing steps rely on the fetch API response status (e.g. bad CORS -> error response). For baseURI, it depends on whether the WebPub specification describes in great detail how to resolve absolute URLs in the manifest, or if this is "automatically" inherited from the JSON-LD processing model (in which case this concern becomes an polyfill / user-agent implementation detail).

Thoughts?

@iherman
Copy link
Member

iherman commented Dec 6, 2018

Taking this out from @danielweck's long comment for an easier reference:

Based on this simple experiment, I am starting to wonder whether ; just like opaque origin ; the WebPub specification should simply remain silent about baseURI edge cases. Once again, I think that the rationale for explicitly null-testing origin/baseURI (e.g. fail => terminate) in the WP manifest acquisition algorithm should be that origin/baseURI is explicitly needed later in the algorithm. For origin, the processing steps rely on the fetch API response status (e.g. bad CORS -> error response). For baseURI, it depends on whether the WebPub specification describes in great detail how to resolve absolute URLs in the manifest, or if this is "automatically" inherited from the JSON-LD processing model (in which case this concern becomes an polyfill / user-agent implementation detail).

I got to a similar conclusion, so I wholeheartedly agree. Although weird, the example with data: URL makes sense but, also, it is perfectly possible to create a manifest using absolute URL-s only and, consequently, the interpretation of the manifest could be oblivious to the null baseURI value.

I think for both this issue and w3c/wpub#321 we should try to find a blanket formulation in the processing which says that if a processing step runs into an error (or a OWP related error?), then the processing would stop and there would be no manifest. (We could put there an note giving examples for such situations, and we can refer to the origin or the baseURI null problem, but that should only be an informal note.) I am not sure how exactly to formulate that, but maybe @mattgarrish can come with the best terminology...

@iherman
Copy link
Member

iherman commented Dec 6, 2018

N.B. I have raised an explicit issue by the JSON-LD WG (w3c/json-ld-syntax#103), a.k.a. passing over the buck:-)

@danielweck
Copy link
Member Author

Thanks Ivan!

Let me also clarify this statement:

For baseURI, it depends on whether the WebPub specification describes in great detail how to resolve absolute URLs in the manifest, or if this is "automatically" inherited from the JSON-LD processing model (in which case this concern becomes an polyfill / user-agent implementation detail).

If the former (i.e. the WP specification describes "parsing" rules, probably as an extension to the JSON-LD processing model), then the manifest algorithm must be clear about what happens when an absolute URL cannot be resolved:

  1. complete failure (i.e. abort loading the manifest entirely)
    or:
  2. skip the unresolved URL (i.e. ignore the resource), and continue loading the rest of the data.

@mattgarrish mattgarrish transferred this issue from w3c/wpub Aug 7, 2019
@iherman
Copy link
Member

iherman commented Aug 9, 2019

All this in a new setting, where we are "only" talking about the strict vocabulary and not the processing models anymore...

Looking at the canonicalization algorithm the only place where the base is used is in step 11, i.e., when relative URL-s are turned into absolute ones. I see two simple options:

  1. remove that step altogether. I.e., how to handle relative URL-s should be left fully under the control of the processor using the manifest, and this should be specified in the corresponding extension. I.e., one could say that if the manifest is used in a packaged audiobook, then the relative URI-s are relative to the top-level of the 'file system' within LPF.

  2. alternatively, if the base is null, then all relative URLs are left as they are.

In fact, the consequence of (2) is still (1), in the sense that the processor specification should still define what a relative URI means within the publication. How is that formally defined in EPUB?


I mildly in favor of (2), i.e., allowing an explicit base setting but falling back on the processor behavior if not used. Note that if we decide for (1) that makes #11 moot as well.

@BigBlueHat
Copy link
Member

@iherman looks like your "canonicalization algorithm" link is going to thew wrong spec.

I'd suggest not doing anything that forks from the JSON-LD processing semantics for @base and be sure to build up any "base" calculations from the same foundation from RFC3986.

@iherman
Copy link
Member

iherman commented Aug 9, 2019

I am sorry, the right link is https://w3c.github.io/pub-manifest/#canonical-manifest

@iherman
Copy link
Member

iherman commented Aug 9, 2019

I certainly wouldn't want to fork. (1) and (2) is to be silent about the issue in the canonicalization...

@iherman
Copy link
Member

iherman commented Sep 10, 2019

This issue was discussed in a meeting.

  • No actions or resolutions
View the transcript 5. Issue #12 Manifest processing model, what if null base URL?
Garth Conboy: Is Daniel on the call to talk about (?)
… issue 12 Manifest processing model, what if null base URL?
Garth Conboy: See Issue #12
Wendy Reid: I need to read this over before I have any opinions… I think we can save this one for discussion. Maybe Ivan has more info?
Ivan Herman: Related to what I said before - at the moment we have the publication manifest, where the base comes from is up to the various profiles…
… it was all about what happens if web content has an iframe, what is the base URL?
… we haven’t solved this issue, but it’s not relevant any more for the manifest…
Garth Conboy: Was that a ‘leave to TPAC’ or ‘close now’?
Ivan Herman: Leave to TPAC…
Garth Conboy: We’ll have Laurent with us at TPAC, so that makes sense.

@wareid wareid closed this as completed Sep 16, 2019
@iherman
Copy link
Member

iherman commented Sep 25, 2019

This issue was discussed in a meeting.

  • RESOLVED: Close Issue #12, the canonicalization algorithm has been changed, origin is no longer a concern for Publication Manifest, but should be considered for specifications concerning discovery
View the transcript Wendy Reid: #12
Wendy Reid: this is my favorite issue!
… what if there’s a null base URL?
… in light of recent changes to the specification, we have gotten rid of the canonicalization model algorithm
… so maybe this is a non-issue
Benjamin Young: we don’t know where these json files are used
… we don’t have an origin now
… if LPF would be to go to REC, we might have to figure out how the base url is calculated
… but until this JSON file is related to some HTML document that can express a base URL, we don’t need to say anything
… it’s blank/null by default
… there are other concerns, but this issue is not an issue
Proposed resolution: Close Issue #12, the canonicalization algorithm has been removed, origin is no longer a concern for Publication Manifest (Wendy Reid)
Benjamin Young: before we vote
… the canonicalization thing has not been removed but renamed
… maybe leave that bit out
… just say it’s a json data document thingy. might not be at a URL
Ralph Swick: do you want to capture bigbluehat’s thought that this will be a concern in the future when the manifest is is included in some future transfer protocol(s)
Proposed resolution: Close Issue #12, the canonicalization algorithm has been changed, origin is no longer a concern for Publication Manifest, but should be considered for specifications concerning discovery (Wendy Reid)
Benjamin Young: +1
Wendy Reid: +1
Laurent Le Meur: +1
Gregorio Pellegrino: +1
Juan Corona: +1
Dave Cramer: +1 with an error of 1
Brady Duga: +1
Toshiaki Koike: +1
Charles LaPierre: +1
Resolution #3: Close Issue #12, the canonicalization algorithm has been changed, origin is no longer a concern for Publication Manifest, but should be considered for specifications concerning discovery

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants