Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is an "opaque origin" and why do we care? #321

Closed
dauwhe opened this issue Aug 27, 2018 · 50 comments
Closed

What is an "opaque origin" and why do we care? #321

dauwhe opened this issue Aug 27, 2018 · 50 comments

Comments

@dauwhe
Copy link
Contributor

dauwhe commented Aug 27, 2018

This comes up in obtaining a manifest and in the life cycle diagrams. But a trip to the HTML spec is no help, defining "opaque origin" as:

An internal value, with no serialisation, for which the only meaningful operation is testing for equality.

How would a document end up having an opaque origin?

@mattgarrish
Copy link
Member

See a little further down from the definition for a list of objects that have opaque origins: sandboxed documents, data urls, network schemes, cross-origin images, etc.

@iherman
Copy link
Member

iherman commented Aug 28, 2018

The origin of that reference goes back to the web app manifest. I agree that it may be too much details to include in the lifecycle, but it does not harm either...

@JayPanoz
Copy link

But a trip to the HTML spec is no help, defining "opaque origin"…

That’s because it’s handled in different specs (HTML, fetch, URL): cf. “When browsers must internally set origin to a value that’ll get serialized as null”. But then I’m afraid you’re finding yourself in the cross/same-origin rabbit hole pretty fast.

Note “this is majorly confusing” for a lot of people.

@RachelComerford
Copy link

So, we use the term "opaque origin" in the spec but none of us can define it? LOL
Agreed on the majorly confusing piece...

What problem does the use of Opaque Origin solve in the spec and is the need something we can use to define it?

@iherman
Copy link
Member

iherman commented Oct 6, 2018

I am fine reviewing this term altogether. We may always put it back if external reviews (e.g., security) makes it necessary.

@JayPanoz
Copy link

JayPanoz commented Oct 6, 2018

For documentation’s sake, here’s a list of references that helped me get the details of the Web Origin Concept:

But yeah there’s no way you can get around it as it’s part of the browsers’ (and WebViews’) security model and you usually discover it the hard way.

Hope that can help, independently of removing/keeping the term in the spec of course.

Note:

Other specifications can override the above definitions by themselves specifying the origin of a particular Document object, image, or media element.

So I guess it’s the default people must refer to anyway.

@JayPanoz
Copy link

JayPanoz commented Oct 6, 2018

That said, it would be nice to have a reworked (i.e. author-understandable) definition in the HTML spec, esp. as it is used pretty often in the issues

@rdeltour
Copy link
Member

rdeltour commented Oct 6, 2018

So, we use the term "opaque origin" in the spec but none of us can define it? LOL
Agreed on the majorly confusing piece...

As far as I can tell, the issue isn't that the term isn't defined (it is, in the HTML standard), it's more that the Web's security model (and associated specs) –for which the 'origin' concept is fundamental– is rather complex and few people have a good understanding of it (I for one consider its details are way above my head 😄).

What problem does the use of Opaque Origin solve in the spec and is the need something we can use to define it?

This term comes from copy/pasting the algorithm from Web App Manifest (there's a little monkey-patching smell to this, btw). As far as I understand they need it to process the start_url field, and to check same-origin constraints for the scope field.

In our own algorithm, the origin is used as the value of the manifest URL when the manifest is embedded in the document.
If we don't abort on opaque origins, it means that the manifest URL may in some case not be deserializable.

I'm not sure exactly what this entails, what are our needs in terms of same-origin checks etc. But rather than rewording, discarding or keeping this term as the result of an uninformed consensus, I think it would be very important to get an expert security review! (in other words, my suggestion is a bit similar to Ivan's but the other way around: get a security review first).

@JayPanoz
Copy link

JayPanoz commented Oct 6, 2018

Nevermind, I edited this post as I don’t want to introduce even more opacity – you can check the history though, if you do want to make it even more opaque.

@llemeurfr
Copy link
Contributor

To come back to the context in which "opaque origin" is used, i.e. How to obtain a manifest: as

  • the sentence "If origin is an [html] opaque origin, terminate this algorithm" is a copy-paste from the Web App Manifest spec,
  • and we don't have the same constraint re. scope and same-origin,
  • and we'd like the algorithm to be usable when we'll define EPUB4, a case where the origin of the entry page is a mystery for me.

=> therefore I propose that we remove this clause "2." from the algorithm, and don't try to solve html issues that are not in our scope.

To put is differently:

What problem does the use of Opaque Origin solve in the spec ?

none, therefore delete.

@JayPanoz
Copy link

JayPanoz commented Oct 6, 2018

and we don't have the same constraint re. scope and same-origin

Hmmmm, I’d respectfully beg to differ there: the same-origin policy will apply whatever you spec. Nowadays, browsers may not even allow devs to disable it with a flag – that authors must then use CORS to handle some use cases is another issue.

But even data: (base64) and file: have an origin.

EPUB4, a case where the origin of the entry page is a mystery for me.

If you’re running a local server, it’s http://localhost:3000 for instance.

If you’re relying on the file:// scheme, it will probably be null – probably because the thing is underspecified and there is an interoperability issue. In that case, severe restrictions are applied.

So it depends how the Reading App handles it.


More generally, there’s a significant amount of issues related to origin open, cf. https://github.com/whatwg/html/issues?q=is%3Aissue+label%3A%22topic%3A+origin%22+is%3Aopen

So it’s definitely something user agents are paying attention to.

[Edit] Sorry, wrong link for file:// scheme.

@JayPanoz
Copy link

JayPanoz commented Oct 6, 2018

That said, I’d vastly prefer UAs (e.g. browsers/webviews) to weigh in since they are probably the only ones having a complete understanding of (opaque) origin.

@JayPanoz
Copy link

JayPanoz commented Oct 6, 2018

Update: https://twitter.com/annevk/status/1048642347800649729?s=20

Note the following issue is probably the best one to do so: whatwg/html#2761

This is a personal opinion, really, but I guess you’re currently on the safe side i.e. “if the document can’t be trusted (because of its origin), don’t try to get the manifest.”

@JayPanoz
Copy link

JayPanoz commented Oct 6, 2018

This probably impacts #205 BTW.

On a related note, Safari’s Reader Mode, Pocket, etc. let users cache articles so it could be interesting to check how they are dealing with different origins, as regards the bounds of a publication…

@iherman
Copy link
Member

iherman commented Oct 7, 2018 via email

@rdeltour
Copy link
Member

rdeltour commented Oct 7, 2018

Your last sentence seems to be a very good possible replacement text and it is probably better than to either to use what is currently in the spec or my earlier proposal to simply remove a reference to the opaque origin.

@iherman, I'm not sure I understand: are you suggesting to replace step (2) in the "obtaining the manifest" algorithm? by which sentence exactly? Jiminy's text “if the document can’t be trusted (because of its origin), don’t try to get the manifest.” is good to get the gist of the issue, but can't work as a drop-in replacement for step (2).

@JayPanoz
Copy link

JayPanoz commented Oct 7, 2018

Yeah I can confirm that’s an oversimplification.

For instance, Anne van Kesteren used the terms “isolated” and “restricted” yesterday.

My biggest worry is that the Origin Concept is fundamental – it is already creating issues and/or complications in EPUB for instance, including security issues –, and if it’s not addressed in details when applicable, that’s something you’ll have to do later anyway – if left to the appreciation of implementers, then there’s a huge risk of interoperability issues.

@iherman
Copy link
Member

iherman commented Oct 7, 2018

@iherman, I'm not sure I understand: are you suggesting to replace step (2) in the "obtaining the manifest" algorithm? by which sentence exactly? Jiminy's text “if the document can’t be trusted (because of its origin), don’t try to get the manifest.” is good to get the gist of the issue, but can't work as a drop-in replacement for step (2).

@rdeltour I must admit I did not think it through in details but, I would think, step (2) may be replaced by essentially the sentence of @JayPanoz:

If origin cannot be trusted (e.g., because of its opaqueness [html]), terminate the algorithm

I realize it remains fairly vague, and may need further review (who can do that?), but at least it makes the algorithm a little bit more understandable.

@iherman
Copy link
Member

iherman commented Oct 7, 2018

@JayPanoz

My biggest worry is that the Origin Concept is fundamental – it is already creating issues and/or complications in EPUB for instance, including security issues

Maybe it would help (it would certainly help me...) if you could give some example of issue or complications with EPUB as of today. Thanks.

@JayPanoz
Copy link

JayPanoz commented Oct 7, 2018

I’ll stick to examples I am familiar with because I learnt it the hard way and I have huge scars to show.

Note I’ll restrict those examples to Blink + WebKit, as they are enough differences already.

Say app A is using the file:// scheme i.e. like opening a static HTML file in the browser:

  • both return null for window.origin (their origin is opaque);
  • Web Storage API’s localStorage:
    • Blink is totally OK sharing it across all file: URLs;
    • Webkit, on the other hand, blocks it for all file: URLs – it doesn’t throw an error BTW, it blocks it silently.
  • framing (iframe) documents embedded in the EPUB itself (no matter they are in folders, etc.):
    • Blink displays it but disallows DOM access;
    • Webkit treats it as about:blank and pops up the Finder/iOS browser.

Say app B is running a local server e.g. http://localhost:3000 to get around those issues:

  • both return http://localhost:3000 for window.origin (it isn’t opaque);
  • Web Storage API’s localStorage: both are now sharing it across all EPUB documents, because they are on the same origin – this is really not good, security-wise, because an EPUB document can check stored items from any other EPUB document, and even write additional ones;
  • framing is OK.

App C therefore uses a custom scheme to solve the Web Storage API’s localStorage. Actually, @danielweck did it a while ago and wrote an explainer. Note this is a security consideration in the spec, but very few Reading Systems bother there.

For the record, I was the one to report the file:// scheme localStorage issue to iBooks (through their bug reporter) a few years ago, it was very surprising to have Apple’s security team contact me within 12 days as they considered it serious, and iBooks switched to a custom scheme since.

Let’s now turn to cloud readers.

Typically, the EPUB ressource is loaded in an iframe so it must abide by the same-origin policy.

  • DOM access is disallowed when:
    • you have the cloud reader @ https://reader.mycompany.com;
    • you have the EPUB file @ https://epubs.mycompany.com.
  • It is OK when:
    • you have the cloud reader @ https://www.mycompany.com/reader/;
    • you have the EPUB file @ https://www.mycompany.com/epubs/.

So at this point the person in charge of the server is already hating you with a passion.

How that goes when the cloud reader and EPUB files are not @ the same company in the first place: typically, the whole HTML document is loaded as srcdoc (and not as a simple src) because relaxing the same-origin policy would take outstanding cooperation there.

It’s worth mentioning I’ve seen some cloud readers fail @ fetching the stylesheet for instance – because of the Content Security Policy. So it’s definitely not trivial, even for backend engineers – and you’ll need those people for WPUB, or else it will be painful.


Now how does this translate to WPUB, I honestly can’t tell as I didn’t read the spec with the origin considerations in mind. What’s for sure though, is that if some (say for the lack of better word) features depend on the same-origin policy, no exception will be made.

In the best-case scenario, the manifest can’t even have an opaque origin and you can remove it entirely. In the worst-case scenario, a lot of issue resolutions/design choices/etc. are impacted. But it’d be nice to see whether there is a risk in the first place.

A quick question to illustrate that: a subdomain is not the same origin as the parent domain by default (i.e. it’s opaque), I could find like 4 instances of the “subdomain” term in 3 issues. But is there any guidance for authors anywhere? Because that one will surprise a lot of people for sure – note you can make it non-opaque with CORS.

@JayPanoz
Copy link

JayPanoz commented Oct 7, 2018

On a superficial sight though – which is provided AS-IS, comment under MIT license –, how I understand it right now: if your website is compromised and the manifest results in an opaque origin – don’t ask me why, I’m not a black hat –, then the UA should abort.

edit: if you received that comment by mail, I edited it as I’d once again defer to UAs because I don’t want to oversimplify things but my gut feeling is “if it doesn’t fit into the security model, it will be a huge issue.”

@JayPanoz
Copy link

JayPanoz commented Oct 8, 2018

Since this is a legacy of the app manifest, here’s some context:

At this point though, it becomes difficult, I guess for everyone, to keep track there – even I have issues at times.

So I’d personally be in favor of sticking to treat web security model/same-origin policies/etc. as another issue if needed – but to check whether it is, I’m afraid you’ll need someone with a very good understanding of such topics, who could also explain scope & al. in W3C manifest and how it relates to same-origin, etc. – I guess it allowed them to remove an entire algorithm checking for same-origin at some point.

Finally, the more I think about it, the more I dig Anne van Kesteren’s “isolated/restricted” explanation. Essentially, this is what UAs are doing: there are “sandboxing” objects whose origin can’t be trusted under the security policy, and restricting some APIs/features accordingly (e.g. DOM access in an <iframe> on a cross-origin, Web Storage, etc.).

@iherman
Copy link
Member

iherman commented Oct 15, 2018

@JayPanoz trying to move on with the draft... I proposed to change the draft in #321 (comment) by replace step (2) by

If origin cannot be trusted (e.g., because of its opaqueness [html]), terminate the algorithm

(ie, essentially your text), and have a reference to the separate section on security. I am painfully aware that that section is currently empty, and something should be put there at some point, with additional explanation (I would not even dare to do it myself:-), but it may make the draft, editorially cleaner. We could then close this issue and, as you suggest, open a separate issue on this whole problem area...

WDYT?

(See you soon in Lyon!)

@JayPanoz
Copy link

Hmmm maybe if you want to remove “opaque” entirely, you could use

if the document is sandboxed (e.g. a cross-origin document in an iframe), terminate the algorithm

This would probably be the typical use-case for such a rule, with <iframe sandbox="…">. So if say you allow embedding of the Web Publication and/or a sample, the UA will ignore the manifest in this context – that should also prevent duplicates and some dark patterns.

It drops some other opaque origins (e.g. document created using a data:// url + file:// scheme, the latter being underspecified anyway) but if you come back to the security topic later, then they can be addressed properly – maybe you don’t even want to restrict the file:// scheme for instance but that could be a tough sell considering how restricted it already is in some UAs.

@JayPanoz
Copy link

JayPanoz commented Oct 15, 2018

Note the W3C manifest lifecycle rewriting is really interesting as well.

It is being redesigned since Microsoft created a wrapper turning Web Apps into Packaged Web Apps to make them available in their app store – Twitter for Windows 10 is a Progressive Web App for instance. That sounds a lot like Package Web Publications.

[Edit] See PWA Builder, esp. this doc (.appx)

@iherman
Copy link
Member

iherman commented Oct 15, 2018

@JayPanoz see #343

@iherman
Copy link
Member

iherman commented Oct 16, 2018

See the discussion in #343; propose closing.

Cc: @dauwhe? @TzviyaSiegman? @wareid? @GarthConboy?

@danielweck
Copy link
Member

Possibly related:
#352 (comment)
(the part about CORS HTTP headers)

@BigBlueHat
Copy link
Member

Also related #104 Browsing contexts and origins are intertwingled.

@JayPanoz
Copy link

@danielweck it is my understanding that it indeed is, cf. audio and video elements, and the terminology for CORS same-origin and CORS cross-origin.

@mattgarrish
Copy link
Member

I'm curious: can a document have an opaque origin and belong to a web publication?

An opaque origin document can't be identified as belonging to any web publication, as it can't be identified in the reading order or resource list without an address.

Are there scenarios in which an opaque origin document can link to a manifest on another domain, or will sandboxing and security rules prevent this? If not, then the manifest has to be embedded and that leads to a web publication with an opaque address, since the manifest can only be embedded in the entry page... the address of the web publication. A web publication without an address isn't a web publication.

The frosted side of me says to hell with restrictions, but the whole wheat side says maybe halting web publication initiation at the first sign of incompatibility is just a good thing to do.

@dauwhe
Copy link
Contributor Author

dauwhe commented Dec 3, 2018

Consequence: since file is a null origin, if there is eventually browser support for WPUB then testing locally would require using localhost rather than just opening the entry page as a file in a browser.

@mattgarrish
Copy link
Member

Isn't that a bit inevitable, though? file: URLs come with restrictions on scripting, API access, etc. How much will be crippled even if you can initiate it?

@JayPanoz
Copy link

JayPanoz commented Dec 4, 2018

How much will be crippled even if you can initiate it?

Yeah objectively this wouldn’t be a bad thing, given it’s not even consistent across browsers (little interop based on lack of standardisation) so it would currently be a bad idea testing file:// instead of localhost anyways.

Other that that, the typical example others have been using is a web app/feature e.g. payment to be found in an iframe → shouldn’t try to retrieve the manifest/shouldn’t enable the feature.

Then just to be sure, I’d like to re-instate as formal-non-spec note that subdomains are opaque origin by default so you must use CORS & al. – and this always, always, raise issues from authors not familiar with the web security policies at first.

I couldn’t necessarily keep up as I’ve ironically had to research and document origin for EPUB and all the nasty issues it might create… but the manifest can’t be on a subdomain anyway, right?

@mattgarrish
Copy link
Member

Other that that, the typical example others have been using is a web app/feature e.g. payment to be found in an iframe → shouldn’t try to retrieve the manifest/shouldn’t enable the feature.

Right, opaque origins are a signal of insecurity/untrustworthiness, so while I agree with @dauwhe that it would make life a lot simpler to be able to test from a file:// url, I'm not sure browsers will initiate a web publication given how they restrict other features.

And if it's unrealistic that they would initiate a web publication for a document with an opaque origin (and I'm not the one to ask if they definitively will, of course, but I see it as probable), then we really have no choice but to live with the restriction. We're not enabling anything by removing it.

@JayPanoz
Copy link

JayPanoz commented Dec 4, 2018

@mattgarrish to clarify as I feel it might not have been clear enough, I was just adding this example to your previous comment, which I am agreeing with.

@mattgarrish
Copy link
Member

I was just adding this example to your previous comment, which I am agreeing with.

Ya, I was just expanding on my earlier answer, as I wrote it a bit hastily last night. I'm not unsympathetic to the file: use case, but I think it's probably unrealistic we could enable it even if we wanted, and then the second hurdle is what else gets disabled. Your example just adds to the evidence of how opaque origins will be treated, which reinforces leaving the step that terminates the initiation of a web publication when a document has an opaque origin.

This issue came up on the call yesterday and there was an open question at the end whether there were any useful scenarios we were disallowing by having this restriction, so I wanted to spend a bit more time searching for an answer to that.

@dauwhe
Copy link
Contributor Author

dauwhe commented Dec 4, 2018

Isn't that a bit inevitable, though? file: URLs come with restrictions on scripting, API access, etc. How much will be crippled even if you can initiate it?

To be clear, I'm completely fine with file not working.

@danielweck
Copy link
Member

danielweck commented Dec 4, 2018

In 5.1) Obtaining a manifest ( https://www.w3.org/TR/wpub/#obtaining-manifest ): step 2) If origin is an [html] opaque origin, terminate this algorithm., but the algorithm subsequently does nothing with the origin of the document (null or not).
By contrast, the notion of "base URL" used to resolve relative paths (i.e. construct absolute URLs) in the "publication manifest" (JSON) is important, and if I am not mistaken this is inferred from the URL of the JSON manifest itself, or that of the enclosing document in the embed case (I am not sure whether the Web Publications specification will allow overriding the "base URL" like XML and HTML do, but that's a separate issue).
So, unless the algorithm makes use of the document's origin (for example to compare it with other origins, for some kind of pass/fail check), then I feel that we should drop step (2) from the sequence. Any origin-related problem (like insufficient CORS headers) will trigger a fail that is already covered by step 9) If response is a network error, terminate this algorithm..

@danielweck
Copy link
Member

Another thought: the opaque origin (or from a browser engine perspective: "internally unique origin that gets serialized to null") generated for a data: URL can be avoided by using the BlobURI construct instead, but in principle it is indeed technically possible for content authors to feed an iframe with a document resource that does not originate from a typical asset locator / external file (URL), and a data: URL is one possible technique that the Web Publications specification should not forbid (in my opinion), regardless of whether there is a valid / sensible use-case for this. It may be an edge-case, but it is a possibility. The Web Publications specification cannot address every possible edge-case explicitly, but when it defines processing algorithms that depend on (for example) the web's "fetch" API, then a code path is expected to handle failures such as those emerging from (cross-)origin issues. We do not have to rehash this logic, it is already handled by other Open Web Platform tools that the WP spec. references. I am a little more concerned about the potential problems related to "base URL" (or lack thereof), which would make it impossible for a user-agent to resolve absolute URLs from relative paths in the JSON manifest. I am not entirely sure how that ties into the "origin" model, thus why I am raising this point here.

@JayPanoz
Copy link

JayPanoz commented Dec 4, 2018

I am a little more concerned about the potential problems related to "base URL" (or lack thereof), which would make it impossible for a user-agent to resolve absolute URLs from relative paths in the JSON manifest.

Sorry, brain dead right now but isn’t that why you have start_url and scope* in the web manifest? (MDN)

* which also happens to be a thing in service workers.

@iherman
Copy link
Member

iherman commented Dec 4, 2018

I am not sure I see the problem. The handling of relative url-s is handled by the json-ld spec.

@danielweck
Copy link
Member

@JayPanoz is the Web Publication manifest an extension of Web App manifest?

@danielweck
Copy link
Member

@iherman what if the Web Publication manifest has no URL to root itself onto? (isn't that a corollary issue to the opaque origin problem that arises from data: URLs used for HTML documents?) Sorry if I misunderstand, I just want to make sure we cover the ground fully.

@danielweck
Copy link
Member

danielweck commented Dec 4, 2018

PS (sorry, didn't send in my previous message):
<base href="..."> can be used inside a data: URL payload, so I am of course referring to the case when no base is defined for the URL-encoded document, and when the manifest is embedded (no external URL).
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base

@JayPanoz
Copy link

JayPanoz commented Dec 4, 2018

@danielweck not that I know of but some parts happen to be a legacy of web app manifest so some clarification would be welcome – as a disclaimer, I’m lazy, which should explain why I’m also a huge believer in prior art and when several specs are taking the same path, I must admit I’m wondering why it happens to be this way for the sake of being a pita – OCD.

@iherman
Copy link
Member

iherman commented Dec 5, 2018

@danielweck

per json-ld:

  • If the json-ld (ie, the publ manifest) is a standalone file, the url of the file is the base for relative url-s
  • If it is embedded, then the baseURI of the script element is the base (this may be influenced, per html spec, by a base element in the html header)

The latter is fresh in the json-ld wg, and was also a decision of the TAG recently, after discussion with the json-ld WG.

If there are still uncertainties, it must be raised and solved by the json-ld WG, and not by this WG, imho. It is a good time, that wg is currently busy with details of exactly that, and must solve that issue due to the predominance of embedded json-ld in schema.org.

@danielweck
Copy link
Member

@iherman indeed, thus why I am asking whether the manifest "obtention" / "processing" algorithm(s) in the Web Publication specification should explicitly terminate if the base URI cannot be computed (as my example ; albeit a freak edge-case one ; illustrates).
I believe that the special "null origin" check in the WP manifest "obtention" algorithm (current WP draft) is not useful, but I am wondering whether processing should interrupt in the absence of base URL for the manifest itself. This way, irrespective of the final outcome of JSON-LD's ongoing standardization effort, the WP processing model makes it clear that no-base-URL is a failure case.

Sorry if I am going off-tangent, but origin and base-URL as related concepts, so rather than having two parallel disconnected discussions, I raised the point here (I am opening a new issue nonetheless :)

@danielweck
Copy link
Member

New issue for base URL:
https://github.com/w3c/wpub/issues/374
Regardless of whether we reach a consensus about removing step 2 (termination upon opaque origin), I expect people who have poured their thoughts into this issue to also have ideas / opinions about the base URL :)

@wareid
Copy link

wareid commented Feb 5, 2019

As discussed on Feb 4 2019, closing this issue, it is being worked on in #374.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants