-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redirects aren't followed if Location header contains non-ASCII symbols #315
Comments
Thanks for filing such a detailed issue! I'll look deeper into this when I can, but off the top of my head I suspect we're failing to parse the |
This is actually a really interesting and complex issue! I'm not 100% confident yet how it should be addressed as there's quite a lot to consider here, so apologies for the wall of text. Is the response valid?The URL
Generally speaking, we are handling this correctly, since we defer to
So at least as far as parsing the response headers, we have parsed the
Following the rabbit trail, we can find out what a
The RFC then goes on to define which characters are and are not valid, of which non-ASCII most certainly is not. Therefore returning the character as the octet Furthermore, note that the parsing rules for URIs aren't much different historically in older obsolete RFCs either, such as RFC 2396, RFC 1808, or RFC 1738. This is not only a current protocol violation, but a historical one as well. Now using the character "é" in a URI is certainly allowed, but it must be percent-encoded as outlined by the spec. Instead of a single byte, it should be included in the path as the literal string What can we do about it?Having established that the response from the URL in question is against the spec, insofar as the
which does call out a relevant point: when visiting https://pt.wowhead.com/npc=171184 in a web browser (Firefox in my case) it recovers and makes a best-effort to handle the redirect anyway and does so correctly. There are, however, some practical concerns. Neither the The approach I am leaning towards would be to have a fallback mechanism -- if parsing the |
Upon further inspection, it seems that actually the bytes in question are But in another sense this is worse, since it means that Wowhead actually is breaking the HTTP message spec in a sense because they are using UTF-8 in an HTTP header field which is definitely not allowed, rather than US-ASCII which is preferred, or ISO-8859-1 which is allowed but discouraged. This can actually be seen played out in the Firefox developer tools, which assume that the There are some interesting discussions on this topic in threads such as aspnet/KestrelHttpServer#1144 and nodejs/node#17390. Ultimately one of the problems we are going to have is: So we know that the Since it seems like it is probably unlikely to use ISO-8859-1 for URLs (which strictly represent UTF-8 data) it would probably work if we try valid US-ASCII as defined in the RFC first, if not then checking if the bytes would be valid as UTF-8, and if so decoding that. If not, ignore the header with a warning as we currently do. |
Try to parse the `Location` header as UTF-8 bytes as a fallback if the header value is not valid US-ASCII. This is technically against the URI spec which requires all literal characters in the URI to be US-ASCII (see [RFC 3986, Section 4.1](https://tools.ietf.org/html/rfc3986#section-4.1)). This is also more or less against the HTTP spec, which historically allowed for ISO-8859-1 text in header values but since was restricted to US-ASCII plus opaque bytes. Never has UTF-8 been encouraged or allowed as-such. See [RFC 7230, Section 3.2.4](https://tools.ietf.org/html/rfc7230#section-3.2.4) for more info. However, some bad or misconfigured web servers will do this anyway, and most web browsers recover from this by allowing and interpreting UTF-8 characters as themselves even though they _should_ have been percent-encoded. The third-party URI parsers that we use have no such leniency, so we percent-encode such bytes (if legal UTF-8) ahead of time before handing them off to the URI parser. This is in the spirit of being generous with what we accept (within reason) while being strict in what we produce. Since real websites exhibit this out-of-spec behavior it is worth handling it. Note that the underlying `tiny_http` library that our HTTP test mocking is based on does not allow UTF-8 header values right now, so we can't really test this efficiently. We already have a couple tests out there doing some raw TCP munging for one reason or another, so in the future we need to make sure to rewrite `testserver` to allow such headers and then enable the test. For now I've manually verified that this works. Fixes #315.
Try to parse the `Location` header as UTF-8 bytes as a fallback if the header value is not valid US-ASCII. This is technically against the URI spec which requires all literal characters in the URI to be US-ASCII (see [RFC 3986, Section 4.1](https://tools.ietf.org/html/rfc3986#section-4.1)). This is also more or less against the HTTP spec, which historically allowed for ISO-8859-1 text in header values but since was restricted to US-ASCII plus opaque bytes. Never has UTF-8 been encouraged or allowed as-such. See [RFC 7230, Section 3.2.4](https://tools.ietf.org/html/rfc7230#section-3.2.4) for more info. However, some bad or misconfigured web servers will do this anyway, and most web browsers recover from this by allowing and interpreting UTF-8 characters as themselves even though they _should_ have been percent-encoded. The third-party URI parsers that we use have no such leniency, so we percent-encode such bytes (if legal UTF-8) ahead of time before handing them off to the URI parser. This is in the spirit of being generous with what we accept (within reason) while being strict in what we produce. Since real websites exhibit this out-of-spec behavior it is worth handling it. Note that the underlying `tiny_http` library that our HTTP test mocking is based on does not allow UTF-8 header values right now, so we can't really test this efficiently. We already have a couple tests out there doing some raw TCP munging for one reason or another, so in the future we need to make sure to rewrite `testserver` to allow such headers and then enable the test. For now I've manually verified that this works. Fixes #315.
@Zireael-N This is now fixed in the 1.3.0 release! |
Thank you for thoroughly investigating and fixing this, 1.3.0 works great! |
Here's a minimal reproducible example:
Cargo.toml:
main.rs:
I noticed that non-ASCII symbols are escaped when I'm printing response headers to stdout, might be related? I tried making a request with
curl -i
and they aren't escaped there.The text was updated successfully, but these errors were encountered: