Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redirects aren't followed if Location header contains non-ASCII symbols #315

Closed
Zireael-N opened this issue Mar 28, 2021 · 5 comments · Fixed by #317
Closed

Redirects aren't followed if Location header contains non-ASCII symbols #315

Zireael-N opened this issue Mar 28, 2021 · 5 comments · Fixed by #317

Comments

@Zireael-N
Copy link

Zireael-N commented Mar 28, 2021

Here's a minimal reproducible example:

Cargo.toml:

[dependencies]
isahc = { version = "1.2", default-features = false, features = ["http2", "static-curl"] }

main.rs:

use isahc::{
    config::{Configurable, RedirectPolicy},
    HttpClient,
    Response,
    Request,
};

fn dump_info<T>(name: &str, res: &Response<T>) {
    println!("{}\n", name);
    println!("Status code: {:?}", res.status());
    println!("Headers: {:#?}", res.headers());
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = HttpClient::new()?;

    let req1 = Request::get("https://www.wowhead.com/npc=171184")
        .redirect_policy(RedirectPolicy::None)
        .body(())?;
    let res1 = client.send(req1)?;

    dump_info("Request 1", &res1);

    // This one is fine
    // Location: /npc=171184/mythresh-skys-talons
    let req2 = Request::get("https://www.wowhead.com/npc=171184")
        .redirect_policy(RedirectPolicy::Follow)
        .body(())?;
    let res2 = client.send(req2)?;

    println!("\n============================\n");
    dump_info("Request 2", &res2);

    // This one fails
    // Location: /npc=171184/mythresh-as-garras-do-céu
    let req3 = Request::get("https://pt.wowhead.com/npc=171184")
        .redirect_policy(RedirectPolicy::Follow)
        .body(())?;
    let res3 = client.send(req3)?;

    println!("\n============================\n");
    dump_info("Request 3", &res3);

    Ok(())
}

I noticed that non-ASCII symbols are escaped when I'm printing response headers to stdout, might be related? I tried making a request with curl -i and they aren't escaped there.

@sagebind
Copy link
Owner

Thanks for filing such a detailed issue! I'll look deeper into this when I can, but off the top of my head I suspect we're failing to parse the Location response header value with the url crate (what we use to resolve absolute vs. relative redirects).

@sagebind
Copy link
Owner

sagebind commented Mar 31, 2021

This is actually a really interesting and complex issue! I'm not 100% confident yet how it should be addressed as there's quite a lot to consider here, so apologies for the wall of text.

Is the response valid?

The URL https://pt.wowhead.com/npc=171184 returns a Location header with the value of /npc=171184/mythresh-as-garras-do-céu. Specifically though, it is encoded using ISO-8859-1; the "é" is encoded as 0xC3. This is valid ISO-8859-1, and while RFC 7230 discourages the use of ISO-8859-1 for header values, it is technically legal for header names previously already standardized. Quote:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.

-- RFC 7230, Section 3.2.4

Generally speaking, we are handling this correctly, since we defer to HeaderValue, which also notes in the docs:

In practice, HTTP header field values are usually valid ASCII. However, the HTTP spec allows for a header value to contain opaque bytes as well. In this case, the header field value is not able to be represented as a string.

So at least as far as parsing the response headers, we have parsed the Location header correctly. However, for redirects it is not so simple, since we must not merely parse the Location value, but we must also parse it as a request URI in order to follow it. For example, in a sister document, it is specified that the value of the Location response header must be a URI:

The "Location" header field is used in some responses to refer to a specific resource in relation to the response. The type of relationship is defined by the combination of request method and status code semantics.

Location = URI-reference

The field value consists of a single URI-reference. When it has the form of a relative reference ([RFC3986], Section 4.2), the final value is computed by resolving it against the effective request URI ([RFC3986], Section 5).

-- RFC 7231, Section 7.1.2

Following the rabbit trail, we can find out what a URI-reference is and see that RFC 3986, goes into great detail on the URI syntax, but in general states the following:

The ABNF notation defines its terminal values to be non-negative integers (codepoints) based on the US-ASCII coded character set [ASCII]. [...]

A URI is composed from a limited set of characters consisting of digits, letters, and a few graphic symbols. A reserved subset of those characters may be used to delimit syntax components within a URI while the remaining characters, including both the unreserved set and those reserved characters not acting as delimiters, define each component's identifying data.

-- RFC 3986, Section 4.1

The RFC then goes on to define which characters are and are not valid, of which non-ASCII most certainly is not. Therefore returning the character as the octet 0xC3 in the URI literally, un-encoded, is most certainly a violation of the spec. That is, the spec of what a valid URI is, which a separate spec also told us the value of the Location header should be interpreted as.

Furthermore, note that the parsing rules for URIs aren't much different historically in older obsolete RFCs either, such as RFC 2396, RFC 1808, or RFC 1738. This is not only a current protocol violation, but a historical one as well.

Now using the character "é" in a URI is certainly allowed, but it must be percent-encoded as outlined by the spec. Instead of a single byte, it should be included in the path as the literal string %C3 or %c3, if indeed the author is using ISO-8859-1 as the encoding of the path (which is a separate concern).

What can we do about it?

Having established that the response from the URL in question is against the spec, insofar as the Location semantics are concerned, technically Isahc's current behavior is perfectly legal and not a bug per-se. Indeed, RFC 7230, Section 2.5 speaks to this exact situation directly:

Unless noted otherwise, a recipient MAY attempt to recover a usable protocol element from an invalid construct. HTTP does not define specific error handling mechanisms except when they have a direct impact on security, since different applications of the protocol require different error handling strategies. For example, a Web browser might wish to transparently recover from a response where the Location header field doesn't parse according to the ABNF, whereas a systems control client might consider any form of error recovery to be dangerous.

which does call out a relevant point: when visiting https://pt.wowhead.com/npc=171184 in a web browser (Firefox in my case) it recovers and makes a best-effort to handle the redirect anyway and does so correctly. curl does as well. So the question then is whether Isahc should replicate this behavior. Without any other constraints, I would say that yes we should handle this scenario. One of Isahc's strengths as an HTTP client is that it is generous in what it accepts (though strict in what it produces), which makes it useful for a variety of applications.

There are, however, some practical concerns. Neither the Uri type nor the Url type have such leniency, in which we use both (the former is also part of our public API). One option would be to pre-normalize the Location value and look for such invalid sequences and try to fix them before parsing, but that would incur a performance penalty that even in-spec responses would have to pay for.

The approach I am leaning towards would be to have a fallback mechanism -- if parsing the Location fails, we can retry the operation with a more lenient, custom parser supplied with Isahc. It would be slower since we'd be essentially parsing the field 2-3 times, but the happy path would remain performant.

@sagebind
Copy link
Owner

sagebind commented Mar 31, 2021

Upon further inspection, it seems that actually the bytes in question are 0xC3 0xA9, which is actually an UTF-8 representation of "é" rather than an ISO-8859-1 representation, which would instead be 0xE9. In one sense this is actually better, since URL-encoding is supposed to be UTF-8 encoded, and this means the URL is kinda-correct, just not URL-encoded like it ought to be.

But in another sense this is worse, since it means that Wowhead actually is breaking the HTTP message spec in a sense because they are using UTF-8 in an HTTP header field which is definitely not allowed, rather than US-ASCII which is preferred, or ISO-8859-1 which is allowed but discouraged. This can actually be seen played out in the Firefox developer tools, which assume that the 0xC3 0xA9 bytes are ISO-8859-1 encoded, which indeed would be the string "é".

Screenshot 2021-03-30 224852

There are some interesting discussions on this topic in threads such as aspnet/KestrelHttpServer#1144 and nodejs/node#17390. Ultimately one of the problems we are going to have is: So we know that the Location isn't valid US-ASCII. Is it UTF-8, or ISO-8859-1? It can be hard to tell -- in this example the bytes can be successfully decoded with either encoding, though as a human we recognize that it is UTF-8 in this case. Based on my reading of the Firefox code it seems like they use some heuristics to "guess" whether the bytes are UTF-8 or ISO-8859-1 and parse accordingly.

Since it seems like it is probably unlikely to use ISO-8859-1 for URLs (which strictly represent UTF-8 data) it would probably work if we try valid US-ASCII as defined in the RFC first, if not then checking if the bytes would be valid as UTF-8, and if so decoding that. If not, ignore the header with a warning as we currently do.

sagebind added a commit that referenced this issue Apr 1, 2021
Try to parse the `Location` header as UTF-8 bytes as a fallback if the header value is not valid US-ASCII. This is technically against the URI spec which requires all literal characters in the URI to be US-ASCII (see [RFC 3986, Section 4.1](https://tools.ietf.org/html/rfc3986#section-4.1)).

This is also more or less against the HTTP spec, which historically allowed for ISO-8859-1 text in header values but since was restricted to US-ASCII plus opaque bytes. Never has UTF-8 been encouraged or allowed as-such. See [RFC 7230, Section 3.2.4](https://tools.ietf.org/html/rfc7230#section-3.2.4) for more info.

However, some bad or misconfigured web servers will do this anyway, and most web browsers recover from this by allowing and interpreting UTF-8 characters as themselves even though they _should_ have been percent-encoded. The third-party URI parsers that we use have no such leniency, so we percent-encode such bytes (if legal UTF-8) ahead of time before handing them off to the URI parser.

This is in the spirit of being generous with what we accept (within reason) while being strict in what we produce. Since real websites exhibit this out-of-spec behavior it is worth handling it.

Note that the underlying `tiny_http` library that our HTTP test mocking is based on does not allow UTF-8 header values right now, so we can't really test this efficiently. We already have a couple tests out there doing some raw TCP munging for one reason or another, so in the future we need to make sure to rewrite `testserver` to allow such headers and then enable the test. For now I've manually verified that this works.

Fixes #315.
sagebind added a commit that referenced this issue Apr 6, 2021
Try to parse the `Location` header as UTF-8 bytes as a fallback if the header value is not valid US-ASCII. This is technically against the URI spec which requires all literal characters in the URI to be US-ASCII (see [RFC 3986, Section 4.1](https://tools.ietf.org/html/rfc3986#section-4.1)).

This is also more or less against the HTTP spec, which historically allowed for ISO-8859-1 text in header values but since was restricted to US-ASCII plus opaque bytes. Never has UTF-8 been encouraged or allowed as-such. See [RFC 7230, Section 3.2.4](https://tools.ietf.org/html/rfc7230#section-3.2.4) for more info.

However, some bad or misconfigured web servers will do this anyway, and most web browsers recover from this by allowing and interpreting UTF-8 characters as themselves even though they _should_ have been percent-encoded. The third-party URI parsers that we use have no such leniency, so we percent-encode such bytes (if legal UTF-8) ahead of time before handing them off to the URI parser.

This is in the spirit of being generous with what we accept (within reason) while being strict in what we produce. Since real websites exhibit this out-of-spec behavior it is worth handling it.

Note that the underlying `tiny_http` library that our HTTP test mocking is based on does not allow UTF-8 header values right now, so we can't really test this efficiently. We already have a couple tests out there doing some raw TCP munging for one reason or another, so in the future we need to make sure to rewrite `testserver` to allow such headers and then enable the test. For now I've manually verified that this works.

Fixes #315.
@sagebind
Copy link
Owner

sagebind commented Apr 7, 2021

@Zireael-N This is now fixed in the 1.3.0 release!

@Zireael-N
Copy link
Author

Thank you for thoroughly investigating and fixing this, 1.3.0 works great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants