Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More CORB-protected MIME types - safelist-based approach #721

Open
annevk opened this issue May 7, 2018 · 35 comments
Open

More CORB-protected MIME types - safelist-based approach #721

annevk opened this issue May 7, 2018 · 35 comments
Labels
security/privacy There are security or privacy implications topic: orb

Comments

@annevk
Copy link
Member

annevk commented May 7, 2018

@anforowicz @csreis did you do an exhaustive evaluation of MIME types? E.g., any reason text/webvtt isn't there? We also blocked text/csv for scripts (standardized and in Firefox at least, not sure if Chrome has fixed that yet), can we block that MIME type for CORB?

@annevk annevk added the security/privacy There are security or privacy implications label May 7, 2018
@anforowicz
Copy link
Contributor

No good reason for not covering other MIME types - covering only HTML/XML/JSON was an outcome of trying to protect the most resources likely to contain sensitive data while at the same time minimizing the risk of breaking existing websites. I very much agree that CORB should be extended to cover more resources, although I think we'd like to wait with this until HTML/XML/JSON implementation from Chromium reaches Chrome stable channel and doesn't uncover any worrisome issues.

I think we have the following options for extending CORB to other MIME types:

  • Explicitly adding MIME types one-by-one (csv, pdf, zip, etc.).
    • It seems that mislabeling of images/audio/video/scripts with these MIME types is unlikely,
      so maybe CORB confirmation sniffing is not necessary for them
    • This doesn't scale and won't help protect 1) many existing types or 2) a hypothetical new JSON-or-XML-like-format-from-the-future
  • Changing CORB so that instead of using a blacklist (HTML/XML/JSON currently) it uses a whitelist (everything other than javascript/css/images/audio/video/fonts/... - see the explainer section here)
    • This seems to be much cleaner/pure design, but it also seems riskier than other approaches - it requires enumerating all the legacy types that cross-origin-safe resources are served with (for example the application/octet-stream case that burned Firefox back in the day).
  • Using an opt-in header (like maybe the From-Origin header)

PS. I am not sure I understand the text/webvtt example. Isn't this an example of a resource that (like video/audio/images) can be embedded cross-origin?

@annevk
Copy link
Member Author

annevk commented May 8, 2018

@anforowicz text/webvtt cannot be fetched cross-origin without CORS, unless I'm missing something.

@jakearchibald
Copy link
Collaborator

I think the safelist approach is the way forward. That way, new APIs that wanted to use no-cors requests would need to make a change to the fetch spec. Through review, we could make sure those responses are distinguishable from content that's likely to contain user data (or encourage them to use CORS instead).

@annevk
Copy link
Member Author

annevk commented May 8, 2018

I think we should forbid no-cors for new APIs (we already held the line for module scripts). Safelist would be great, but it's a little unclear to me how feasible that is.

@anforowicz
Copy link
Contributor

@anforowicz text/webvtt cannot be fetched cross-origin without CORS, unless I'm missing something.

TIL :-) In that case I don't have concerns with including text/webvtt on the blacklist, next to HTML/XML/JSON/etc.

I think the safelist approach is the way forward.

I agree. From Chrome side we can probably use https://crbug.com/802836 for tracking. Before we proceed I think we need to:

  • Decide how to handle a responses with a missing Content-Type header
  • Flesh out the contents of the blacklist. I think so far we have:
    • JavaScript MIME type
      like application/javascript or text/jscript
    • text/css
    • image types like types
      matching image/*
    • audio or video types
      like audio/*, video/* or application/ogg
    • font/* or one of legacy font types
    • multipart/*
    • Other MIME types like application/octet-stream
    • I am also not sure what special treatment (if any) application/x-shockwave-flash should get

Also - if (for whatever reason) we decide to stick with the blacklist approach, then how can we decide what MIME types to cover in addition to HTML/XML/JSON? So far I think I've heard:

  • text/webvtt (and also text/vtt? how likely is it that sensitive/origin-bound data is present here?)
  • application/pdf (example of sensitive data: bank statement)
  • application/zip (example of sensitive data: downloading a directory from Google Drive / Dropbox / etc.)

@jakearchibald
Copy link
Collaborator

A good start on the safelist approach would be to list all the APIs that can consume opaque responses.

@annevk
Copy link
Member Author

annevk commented May 10, 2018

text/vtt is the correct MIME type. text/webvtt is something I made up. I don't think it's likely to contain private information, but if we stick to a blocklist we should block as many MIME types as we can. I think these are all the no-cors consumers:

  • Plugins
  • Images
  • Media (except for subtitles, those are CORS-bound)
  • CSS (this requires a correct MIME type cross-origin fortunately so only need to safelist text/css for this not to break)
  • Classic JavaScript

I think fonts are CORS-bound and therefore do not need to be considered, but perhaps not in Safari still.

However, all of the above are pretty notorious for being served with the wrong MIME type. Perhaps it's slightly better if require X-Content-Type-Options as well as allow some of the wrong MIME types (such as application/octet-stream), but then the question becomes how many sites specify X-Content-Type-Options. (Since otherwise sites still need to perform work in order to get protection.)

@jakearchibald
Copy link
Collaborator

So the safelist approach would be:

  1. If destination is style, and mime type is not text/css, return blocked.
  2. If the response has opted out of sniffing, and the mime-type is not one of the allowed no-cors types, return blocked.
  3. Use the destination of the response to filter the set of sniffing algorithms to try (so if the destination is 'image', only keep the image sniffing algorithms. This is a performance optimisation, nothing to do with security).
  4. Try each of the sniffing algorithms. If any match, return allowed.
  5. Return blocked.

In cases where the destination doesn't filter the sniffing algorithms (eg fetch(url, {mode: 'no-cors'}), it might be slow due to the amount of checking. We either just take the hit here, or find a way to keep an opaque response's body in another process until we can perform a specific sniff.

I think fonts are CORS-bound and therefore do not need to be considered, but perhaps not in Safari still.

Yeah they're CORS requests according to the spec. If Safari's still violating the spec there, they'll probably have to violate/extend the spec here too, to add font mime type checking and content sniffing.

@anforowicz
Copy link
Contributor

If destination is style, and mime type is not text/css, return blocked.

"destination" might not always be available (not sure if the "network service" in Chromium will be aware of it - I think it tries to shy away from any content::ResourceType knowledge) or trustworthy (the renderer can just lie). Therefore, I'd rather avoid using the "destination" for a decision here.

[...] keep the image sniffing algorithms [...]

Currently CORB only sniffs for HTML/XML/JSON to confirm that blocking is okay (sniffing for blocked resources). I think this is a fine approach going forward as well (we can discuss whether similar confirmation sniffing is needed for some other types like PDF or ZIP - I think the answer is "no - not needed" because mislabeling of an image or script as PDF should be hopefully rare).

I don't understand why we'd switch this approach to sniffing for allowed resources:

  • This might be problematic if "destination" is not available.
  • It is difficult to properly cover a hypothetical image/future-format in this kind of sniffing

fonts are CORS-bound

Good to know. This possibly makes things slightly easier.

However, all of the above are pretty notorious for being served with the wrong MIME type.

Yes :-(

The proposed restriction of safelist-based blocking to nosniff responses might help, but I do note that it didn't help in the application/octetstream case encountered earlier by Firefox. There are probably other similar cases like this :-/

@annevk
Copy link
Member Author

annevk commented May 10, 2018

I think it's fine to block future (image) formats. CORS can be used for those.

@jakearchibald
Copy link
Collaborator

@anforowicz

"destination" might not always be available (not sure if the "network service" in Chromium will be aware of it - I think it tries to shy away from any content::ResourceType knowledge) or trustworthy (the renderer can just lie). Therefore, I'd rather avoid using the "destination" for a decision here.

That step's more of an early-exit. It'll also be handled in the later steps in case of fetch(url, { mode: 'no-cors' }).

I don't understand why we'd switch this approach to sniffing for allowed resources:

Isn't it better to further limit the amount of no-cors data that can end up in the process? It just seems like no-cors is a source of so many security issues, so trying to restrict it as much as possible seems like a good thing.

@anforowicz
Copy link
Contributor

I thought that I'd mention that @arturjanc and @mikewest published https://www.arturjanc.com/cross-origin-infoleaks.pdf where they list "doesn't protect all resources" as one of major cons of the current CORB algorithm.

@anforowicz
Copy link
Contributor

I don't understand why we'd switch this approach to sniffing for allowed resources:

Isn't it better to further limit the amount of no-cors data that can end up in the process? It just seems like no-cors is a source of so many security issues, so trying to restrict it as much as possible seems like a good thing.

I agree that it is better to limit the about of no-cors data that can end up in a cross-origin process. I don't understand why above you are proposing to use "image sniffing algorithms" rather than continue using the current CORB approach of only doing confirmation sniffing. In other words, I don't understand what is wrong with the following sniffing scheme for CORB:

  • Content-Type: text/html => sniff to confirm that the response really contains html (and not a html/js polyglot or an image for example). block only if 100% confident that the response really contains html.
  • Content-Type: text/json, Content-Type: application/xml => sniff similarily to the above (i.e. only attempt to confirm the declared Content-Type)
  • New Content-Type (e.g. PDF/ZIP/etc.) - block outright without any CORB sniffing (assuming that mislabeling an image as PDF is unlikely / behaving as-if nosniff was present in response headers).

annevk pushed a commit that referenced this issue May 17, 2018
CORB is an additional filter for responses of cross-origin "no-cors" 
fetches. It aims to provide defense-in-depth protection for JSON, 
HTML, XML (though not image/svg+xml), and (sometimes) text/plain 
resources against cross-process CPU exploits. It also makes it harder 
to use incorrectly labeled resources as scripts, images, fonts, etc.

Discussion and further work is tracked by #681 and #721.

Tests are in web-platform-tests's fetch/corb directory.
@annevk
Copy link
Member Author

annevk commented May 17, 2018

@anforowicz if you want to switch to a safelist-based approach, you wouldn't want to confirm something is HTML (which is not allowed anyway), but rather you'd want to confirm something is an image (which is allowed).

I was thinking of splitting up this issue earlier. Create a new issue for how to go about switching to a safelist. And keep this issue for extending the current approach. That might help.

@anforowicz
Copy link
Contributor

@anforowicz if you want to switch to a safelist-based approach, you wouldn't want to confirm something is HTML (which is not allowed anyway), but rather you'd want to confirm something is an image (which is allowed).

I guess there might have been a misunderstanding of what each of us means by a "safelist-based approach" - possibly I didn't look carefully enough at @jakearchibald's earlier comment with a more detailed proposal of what is meant by "safelist-based approach".

I think there are 2 options of introducing a "safelist-based approach":

  • Option Typo #1: add a safelist-based decision to the current CORB algorithm:

    • HTML/XML/JSON and text/plain: current behavior of html/json/xml confirmation sniffing (unless nosniff or 206)
    • image/audio/javascript/other-safelisted-type: allow
    • any other type: always block without any sniffing
  • Option Added note about basic fetch being an internal algo #2: replace the current CORB algorithm with a safelist-based decision:

    • image/audio/javascript/other-safelisted-type: allow
    • any other type: sniff to check if the response is an image/audio/javascript/other-safelisted-type and only block if it is not

I think @annevk and @jakearchibald are proposing Option #2. I think that Option #1 might be easier to implement (with a well-defined, small set of formats to sniff for).

I was thinking of splitting up this issue earlier. Create a new issue for how to go about switching to a safelist. And keep this issue for extending the current approach. That might help.

I am not opposed, although I do note that we've accumulated quite a bit of safelist discussion here already...

@jakearchibald
Copy link
Collaborator

Yeah, I'm suggesting option 2.

@anforowicz
Copy link
Contributor

I agree that a "safelist"-based approach is correct for the long-term, but it may take some time to implement (considering the non-negligible risk of breaking existing websites). So, I've opened #860 to discuss if in the short-term it might make sense to add extra individual MIME types to the list of types protected by CORB.

@annevk - I wonder if you could maybe change the subject of the current issue (#721) to something like "More CORB-protected MIME types - 'safelist'-based approach"? I couldn't figure out how to change the subject myself (maybe I cannot do it as I am not the reporter?)... :-/

@annevk annevk changed the title More CORB-protected MIME types More CORB-protected MIME types - safelist-based approach Jan 21, 2019
@anforowicz
Copy link
Contributor

@csreis and me discussed this yesterday and I'd like to share some more thoughts about the "safelist"-based approach (having CORB block everything other than known safe types like images/audio/video/javascript/stylesheets). These are just more notes about options 1 and 2 from #721 (comment):

  • Option 1: Look at Content-Type header to detect if response is a safe type

    • This is easy to implement
    • This seems likely to break existing websites (e.g. see the application/octet-stream trouble here). To prevent breaking existing websites, the current html/xml/json confirmation sniffing would probably need to be retained.
    • This should cover future image/audio/video types without any implementation changes (assuming they are not xml-based...)
  • Option 2: Sniff to detect if response is a safe type

    • Sniffing of Javascript and CSS seems difficult to implement in practice (?)
    • Robust sniffing seems unlikely to break existing websites
    • It is unclear if/how this might extend to future image/audio/video types (would they need to pretend that they sniff as one of earlier types so that CORB allows them? would we force CORS for them (not sure if this is doable for img tag)).

@annevk
Copy link
Member Author

annevk commented Feb 28, 2019

See also #870. There's many Content-Type values that'll be happily parsed as script. For option 2:

  1. I think we should figure out to what extent cross-origin style sheets are a problem. It would be interesting to know how many Chrome fetches that lack a Content-Type header or have a Content-Type header value that cannot be parsed. (Otherwise a strict match for text/css is required.) https://bugzilla.mozilla.org/show_bug.cgi?id=1531405 tracks this idea in Firefox.
  2. For the subset of JavaScript fetches that lack or have an improper Content-Type header it might be feasible to "validate" it in a separate process by running a parser. This would slow them down and a valid MIME type could be used to get out of that.
  3. I think requiring CORS for new "types" is very reasonable and we already successfully did that for JavaScript modules.

@anforowicz
Copy link
Contributor

For option 2 (sniffing) @acolwell also points out that sniffing is challenging for range requests (which have to be supported for media content). FWIW, the current CORB implementation treats responses to range requests as if the X-Content-Type-Options: nosniff response header was present.

And I would also like to point out that sniffing should ideally be based on the first X bytes of the response (right not Chromium sniffing only looks at most at the first 1024 bytes to avoid increasing the latency).

@annevk
Copy link
Member Author

annevk commented Mar 1, 2019

What effect does X-Content-Type-Options: nosniff have for media requests? It's not defined to have any. If we can require Content-Type for range requests that'd be ideal though, I don't really see how we'd handle those otherwise indeed. (Agreed on a 1024 limit.)

@annevk
Copy link
Member Author

annevk commented Mar 5, 2019

I chatted with @jakearchibald and I'm no longer convinced range requests are much of an issue. It seems we first always determine the type of a media resource and then any range requests will go to a fixed URL, meaning it'll only potentially "leak" content of a resource identified to be a media resource. See also #144 (comment). The main thing remaining here is an exact list of file signatures we would want to safelist. And not allow extensions of that list (use CORS).

@lukewagner also considers a basic JavaScript-validating function to be doable (if you don't want the performance hit, use a valid Content-Type or CORS).

Which leaves CSS, getting telemetry on that would be great.

@jakearchibald
Copy link
Collaborator

any range requests will go to a fixed URL

The URL can change as long as all data is visible. Some research on this #145 (comment).

But yeah, response URLs cannot be mixed if data is opaque. And opaque data cannot be mixed with visible data. That's what casued https://jakearchibald.com/2018/i-discovered-a-browser-bug/.

@anforowicz
Copy link
Contributor

So, is your proposal to add some state to CORB, so that range requests are handled like this:

  • If CORB sees a 206 response then
    • If this is a same-origin response, then return allowed
    • If this is a cross-origin response, then check if 1) non-range response was received for the identical URL in the past and 2) whether the past response sniffed as one of safelist content types (e.g. as media / js / css).

Did I get that right?

@annevk
Copy link
Member Author

annevk commented Mar 5, 2019

Yeah, due to range responses it's not a clean filter, but a filter and some state about prior actions.

@jakearchibald
Copy link
Collaborator

jakearchibald commented Mar 5, 2019

In order to 'unlock' opaque responses that start at a range other than 0, Fetch could require the caller to provide the response for the first chunk so it can be validated. Kinda similar to https://wicg.github.io/background-fetch/#validate-partial-response-algorithm.

@annevk
Copy link
Member Author

annevk commented Mar 5, 2019

A safelisted MIME type in a Content-Type header should work as well. Those would generally bypass the filter. Perhaps we should sketch out the high-level algorithm so folks can poke it at more easily.

@anforowicz
Copy link
Contributor

One thing that might need some clarification is how the state's lifetime is managed. If all previous sniff results need to be stored, then there will be unbounded memory growth. OTOH, if we discard previous sniff results, then we risk that a range request issued after waking up from a long sleep will be incorrectly blocked by CORB.

@anforowicz
Copy link
Contributor

Actually, I guess that we can have separate CORB state for each execution context (frame, service worker, etc.) and discard CORB state when the frame goes away. So maybe the memory growth concerns are not a big isssue. And some other heuristics can also help in a non-malicious case (e.g. only storing state for range-requestable resources - for audio and video, but maybe not for javascript, stylesheets and images). I am not sure if there is something that can be done to prevent unbounded memory growth in a malicious case - maybe we can keep at most X entries in the cache and purge least recently used ones when overflowing the threshold.

@annevk
Copy link
Member Author

annevk commented Mar 6, 2019

Range requests are only a problem for audio/video at the moment I think, but there is some model where we could tie things to the element doing the requests that would minimize unbounded memory growth. I suspect tying it to the lifetime of the fetch group is more reasonable though. I sketched out the algorithm and necessary state below:


A user agent has a no-CORS-safelisted requesters set.

A request has an associated no-CORS media identifier (null or an opaque identifier). Null unless explicitly stated otherwise.

[The idea here is that the no-CORS media identifier is owned by the media element (audio/video only; I'm assuming we won't do range for images without at least requiring MIME types at this point). As part of the element being GC'd, it would send a message to get all the relevant entries from the user agent's no-CORS-safelisted requesters set removed. There might be better strategies available here and it's not clear to me to what extent we need to specify this, but it's probably good to have a model that does not leak memory forever so the set needs to be keyed to something. The fetch group might also be reasonable.]

A no-CORS-safelisted media MIME type is ... (audio, video).

A no-CORS-safelisted non-media MIME type is ... (image, text/css, JavaScript MIME types).

To determine whether to allow response response to a request request, run these steps:

  1. Let mimeType be the result of parsing Content-Type of response.
  2. If mimeType is a no-CORS-safelisted media MIME type, then:
    1. Append (request's no-CORS media identifier, request's current URL) to the user agent's no-CORS-safelisted requesters set.
    2. Return true.
  3. If mimeType is a no-CORS-safelisted non-media MIME type, then return true.
  4. If the user agent's no-CORS-safelisted requesters set contains (request's no-CORS media identifier, request's current URL), then return true.
  5. Wait for 1024 bytes of response or end-of-file, whichever comes first and let bytes be those bytes.
  6. If the image type pattern matching algorithm given bytes does not return undefined, then return true.
  7. If the audio or video type pattern matching algorithm given bytes does not return undefined, then:
    1. Append (request's no-CORS media identifier, request's current URL) to the user agent's no-CORS-safelisted requesters set.
    2. Return true.
  8. If the JavaScript pattern matching algorithm does not return undefined, then return true.
  9. Return false.

Note: user agents are encouraged to optimize the above algorithm, by taking into account the context where the request originated from.

@acolwell
Copy link

acolwell commented Mar 6, 2019

I'm not sure I completely understand all the issues here so I apologize if I am missing something. I just wanted to point out that range requests are not only limited to requests initiated by a audio/video tag. Web applications that use Media Source Extensions (MSE) typically use XHR and fetch to get media data and will use range requests to fetch media segments from the middle of a file. I think in most sane scenarios @jakearchibald suggestion of referencing a request that accesses the beginning of the file could work because these applications usually access the beginning of the file to get the "initialization segment" for initializing codecs.

There are some "advanced" scenarios where this initialization data might be passed down embedded in the HTML/JS to "save a round trip" and in that case it is likely that an XHR or fetch with a Content-Range would occur on a URL w/o accessing the beginning of the resource. I understand we probably can't accommodate all scenarios but I just wanted to point this out as a thing applications might do.

Another thing I noticed while trying to follow this discussion is that the "audio or video type pattern matching algorithm" might not properly detect Basic Media Segments(http://wiki.webmproject.org/adaptive-streaming/webm-dash-specification) as media because they don't contain the EBML header that the sniffing logic is looking for. These types of segment are used to avoid the overhead of resending the initialization segment in each segment request. IIUC though this sniffing only occurs if the Content-Type isn't set correctly so I guess it is ok if this fails. It might encourage sites to fix their Content-Type.

@jakearchibald
Copy link
Collaborator

@annevk

but there is some model where we could tie things to the element doing the requests that would minimize unbounded memory growth

I think this is the right approach. In the case of opaque data, all responses need to come from the same place. The only thing that can assert this is the API that processes the data (eg the <audio>). It's the only thing smart enough to know that if the src changes, it's requesting a 'new' resource.

Since <audio> would need to keep the original response url (and maybe headers) around anyway, it may as well be used for CORB too.

@annevk
Copy link
Member Author

annevk commented Mar 7, 2019

@acolwell in the case of MSE via fetch() or XMLHttpRequest there's no opaque responses aiui, so those responses would not end up being filtered at all by this algorithm. Those responses need to be same-origin or use CORS.

@annevk annevk mentioned this issue Nov 11, 2019
annevk added a commit to annevk/orb that referenced this issue Feb 10, 2020
annevk added a commit to annevk/orb that referenced this issue Feb 11, 2020
whatwg/fetch#721 (comment) with some lessons learned.

Co-Authored-By: Jake Archibald <jaffathecake@gmail.com>
Co-Authored-By: Nathan Froyd <froydnj@gmail.com>
@annevk
Copy link
Member Author

annevk commented Feb 11, 2020

I created https://github.com/annevk/orb to flush out the idea above more concretely. The challenges to me seem:

  1. Range requests. I'm not confident the current model works with redirects and such. Or are they not allowed? Maintaining a requesters set is also not great.

  2. One of the last steps:

    If response's body parses as JavaScript and does not parse as JSON, then return true.

    Perhaps waiting for and parsing the complete response is fine as it only affects a small number of responses (and maybe there are ways to further bring that number down), but I'm not sure if that's acceptable to all.

Thoughts on those more than appreciated!

@annevk
Copy link
Member Author

annevk commented May 17, 2022

There is now a PR for this in #1442.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
security/privacy There are security or privacy implications topic: orb
Development

No branches or pull requests

4 participants