Handle raw UTF-8 bytes in redirect headers #317

sagebind · 2021-04-01T02:31:17Z

Try to parse the Location header as UTF-8 bytes as a fallback if the header value is not valid US-ASCII. This is technically against the URI spec which requires all literal characters in the URI to be US-ASCII (see RFC 3986, Section 4.1).

This is also more or less against the HTTP spec, which historically allowed for ISO-8859-1 text in header values but since was restricted to US-ASCII plus opaque bytes. Never has UTF-8 been encouraged or allowed as-such. See RFC 7230, Section 3.2.4 for more info.

However, some bad or misconfigured web servers will do this anyway, and most web browsers recover from this by allowing and interpreting UTF-8 characters as themselves even though they should have been percent-encoded. The third-party URI parsers that we use have no such leniency, so we percent-encode such bytes (if legal UTF-8) ahead of time before handing them off to the URI parser.

This is in the spirit of being generous with what we accept (within reason) while being strict in what we produce. Since real websites exhibit this out-of-spec behavior it is worth handling it.

Note that the underlying tiny_http library that our HTTP test mocking is based on does not allow UTF-8 header values right now, so we can't really test this efficiently. We already have a couple tests out there doing some raw TCP munging for one reason or another, so in the future we need to make sure to rewrite testserver to allow such headers and then enable the test. For now I've manually verified that this works.

Fixes #315.

Try to parse the `Location` header as UTF-8 bytes as a fallback if the header value is not valid US-ASCII. This is technically against the URI spec which requires all literal characters in the URI to be US-ASCII (see [RFC 3986, Section 4.1](https://tools.ietf.org/html/rfc3986#section-4.1)). This is also more or less against the HTTP spec, which historically allowed for ISO-8859-1 text in header values but since was restricted to US-ASCII plus opaque bytes. Never has UTF-8 been encouraged or allowed as-such. See [RFC 7230, Section 3.2.4](https://tools.ietf.org/html/rfc7230#section-3.2.4) for more info. However, some bad or misconfigured web servers will do this anyway, and most web browsers recover from this by allowing and interpreting UTF-8 characters as themselves even though they _should_ have been percent-encoded. The third-party URI parsers that we use have no such leniency, so we percent-encode such bytes (if legal UTF-8) ahead of time before handing them off to the URI parser. This is in the spirit of being generous with what we accept (within reason) while being strict in what we produce. Since real websites exhibit this out-of-spec behavior it is worth handling it. Note that the underlying `tiny_http` library that our HTTP test mocking is based on does not allow UTF-8 header values right now, so we can't really test this efficiently. We already have a couple tests out there doing some raw TCP munging for one reason or another, so in the future we need to make sure to rewrite `testserver` to allow such headers and then enable the test. For now I've manually verified that this works. Fixes #315.

codecov · 2021-04-01T02:38:35Z

Codecov Report

Merging #317 (bb6d283) into master (3522593) will decrease coverage by 0.31%.
The diff coverage is 55.10%.

@@            Coverage Diff             @@
##           master     #317      +/-   ##
==========================================
- Coverage   75.25%   74.94%   -0.32%     
==========================================
  Files          57       57              
  Lines        2983     2993      +10     
==========================================
- Hits         2245     2243       -2     
- Misses        738      750      +12

Impacted Files	Coverage Δ
src/cookies/psl/mod.rs	`82.92% <ø> (ø)`
src/error.rs	`44.66% <0.00%> (-0.67%)`	⬇️
tests/redirects.rs	`100.00% <ø> (ø)`
src/redirect.rs	`71.08% <33.33%> (-6.70%)`	⬇️
src/agent/mod.rs	`72.68% <62.50%> (-0.05%)`	⬇️
src/agent/selector.rs	`74.41% <64.28%> (-0.59%)`	⬇️
src/agent/timer.rs	`100.00% <100.00%> (ø)`
src/handler.rs	`69.72% <0.00%> (-1.38%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3522593...bb6d283. Read the comment docs.

sagebind · 2021-04-01T02:46:18Z

@teto-bot rustfmt

Apply rustfmt

bb6d283

sagebind merged commit 70100a4 into master Apr 6, 2021

sagebind deleted the 315-raw-utf8-redirect-header branch April 6, 2021 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle raw UTF-8 bytes in redirect headers #317

Handle raw UTF-8 bytes in redirect headers #317

sagebind commented Apr 1, 2021

codecov bot commented Apr 1, 2021 •

edited

Loading

sagebind commented Apr 1, 2021

Handle raw UTF-8 bytes in redirect headers #317

Handle raw UTF-8 bytes in redirect headers #317

Conversation

sagebind commented Apr 1, 2021

codecov bot commented Apr 1, 2021 • edited Loading

Codecov Report

sagebind commented Apr 1, 2021

codecov bot commented Apr 1, 2021 •

edited

Loading