`http` silently corrupts the request URL when it contains non-Latin-1 codepoints #13296

Flimm · 2017-05-30T13:21:37Z

I experience the bug on Node commit 399cb25

Darwin Tephs-Mac-Pro.local 16.6.0 Darwin Kernel Version 16.6.0: Fri Apr 14 16:21:16 PDT 2017; root:xnu-3789.60.24~6/RELEASE_X86_64 x86_64 .

Subsystem: http

Issue description

When making a request using http.get with the path set to /café🐶, the server receives /café=6. This is not the URL that was sent, it's not even the precent-encoded version of the URL (with encodeURI), which would be /caf%C3%A9%F0%9F%90%B6. I expected the URL to be passed along without data corruption.

I've created a pull request which contains a test case that currently fails, illustrating this bug. #13297

The text was updated successfully, but these errors were encountered:

This test currently fails. It illustrates that Unicode in the URL does not arrive intact to the server, there is silent data corruption along the way at some point. This test is for the issue nodejs#13296.

jasnell · 2017-05-30T13:35:01Z

For sake of performance the impl does not transcode or escape the input into valid ASCII. It assumes the user has done the necessary escaping. That is unlikely to change but can be documented.

Flimm · 2017-05-30T13:44:22Z

@jasnell That's great, but that's not what this issue is about. I'm not asking in this issue to transcode or escape the input into valid ASCII. All I'm asking is that silent data corruption not happen. There are solutions for that, one of them is to pass the data without corrupting it. Another would be to throw an exception when Unicode strings are passed in of course. I'll leave it up to core contributors to decide on the best solution to fix the bug that data is getting corrupted silently.

When /café is passed as a URL, it gets sent as /café (encoded in UTF-8, I assume) to the server. No data loss or data corruption happens, even though that URL is not ASCII. It could be that the server can't handle it, but at least no silent data loss or corruption has happened.

So when is the silent data corruption happening?

When /café🐶 is passed as a URL, it gets sent as /café=6, instead of /café🐶. Why is it corrupting the data? This data corruption is helping nobody. No server is going to read =6 and understand that that is meant to be the dog emoji.

(It's worth noting that http.get will check the passed URL for whitespace and throw an exception in the case that it finds a space, despite performance costs.)

seishun · 2017-05-31T11:09:34Z

I agree, this behavior is unnecessarily unintuitive. It should either:

silently percent-encode non-ASCII characters, or
encode as UTF-8, or
throw on non-Latin-1 characters instead of silently breaking them

Options 1 or 2 could be breaking if someone is deliberately using "binary" encoding to send UTF-8 encoded paths, e.g. path: Buffer.from('/wiki/サクラ').toString('binary'). But if anyone is actually doing that, we should just let them specify the path as a Buffer instead of resorting to such hacks.

They could also be breaking if someone actually wants Latin-1 encoded paths (or the server they are talking to is smart enough to recognize it's not UTF-8) and they just happen to use only the Latin-1 range of characters. So path: '/wiki/café' just works by luck. But that scenario seems even less likely than the above.

Here's my proposal to fix this mess:

Allow specifying paths as Buffers.
Runtime-deprecate non-ASCII Latin-1 characters in string paths (a warning notifying that the characters are encoded as Latin-1 and suggesting using Buffer instead).
Throw on non-ASCII non-Latin-1 characters in string paths (they are broken anyway).

In a later major version of Node.js, we could consider one of the following:

Encode paths as UTF-8 instead of Latin-1.
Percent-encode non-ASCII characters in paths.
Disallow non-ASCII completely in paths completely.
Leave it as-is.

cc @nodejs/collaborators feel free to criticize.

bnoordhuis · 2017-05-31T11:26:13Z

@seishun See #3062 (comment). Allow me to paraphrase myself:

The set of characters to reject depends on the encoding used for the request headers. That in turn is influenced by the encoding of the request body because node.js tries hard to pack the headers and the body into a single outgoing packet.

An example: U+010A ('Ċ') is fine with encoding="utf8"; it decodes to bytes C4 8A. The same codepoint should be rejected with encoding="binary" (or "latin1") because it decodes to byte 0A, a newline.

IOW, the following statement:

Throw on non-ASCII non-Latin-1 characters in string paths (they are broken anyway).

Is not true (or only conditionally true.) As well, UTF-8 URLs are used widely enough that rejecting them outright is probably not going to fly.

I think the first order of business is to untangle the conflation of headers and body somehow. Unfortunately, the naive approach is riddled with performance pitfalls and some backwards compatibility concerns.

seishun · 2017-05-31T13:35:41Z

@bnoordhuis
I see, so there are two issues here:

Headers are encoded using the same encoding as the first data chunk.
In the absence of any data chunks, headers are encoded using latin-1.

I think both of these could be fixed without touching the conflation of headers and body.

The set of characters to reject depends on the encoding used for the request headers.

So it boils down to two questions:

Do we want to allow setting the encoding for the headers?
Which encoding do we use/default to for the headers?

I think most would agree that defaulting to/using the encoding of the first data chunk or 'latin1' is broken.

jasnell · 2017-05-31T14:00:51Z

Indeed. Encoding of the headers should have absolutely nothing to do with the encoding of the payload.

seishun · 2017-06-01T09:08:58Z

Great, if we agree on that, then the next question is which assumptions we can make.

If we assume that (almost) no one relies on the headers being encoded as Latin-1 (i.e. when there is no payload), then we can just always encode the headers as UTF-8 in the next major version.
If we assume that (almost) no one relies on the headers being encoded as UTF-8 (i.e. when the first payload chunk is UTF-8), then see my proposal in http silently corrupts the request URL when it contains non-Latin-1 codepoints #13296 (comment).
If we can't assume either of the above, then, as @bnoordhuis pointed out, we can't throw on non-Latin-1 codepoints (or we can do that only if it's going to be encoded as Latin-1, which seems pretty complicated). But if we agree that the header encoding should be independent of the payload, then we should introduce a warning for non-ASCII codepoints about the planned change in behavior, and allow specifying the path as a Buffer as an unambiguous alternative.

This test currently fails. It illustrates that Unicode in the URL does not arrive intact to the server, there is silent data corruption along the way at some point. This test is for the issue #13296. PR-URL: #13297 Reviewed-By: James M Snell <jasnell@gmail.com>

seishun · 2017-06-02T14:00:28Z

@Flimm

When /café is passed as a URL, it gets sent as /café (encoded in UTF-8) to the server.

If you aren't sending any payload, then it should be encoded in Latin-1. Could you re-check?

This test currently fails. It illustrates that Unicode in the URL does not arrive intact to the server, there is silent data corruption along the way at some point. This test is for the issue #13296. PR-URL: #13297 Reviewed-By: James M Snell <jasnell@gmail.com>

Flimm · 2017-06-13T17:17:11Z

@seishun Sorry, that was just an assumption on my part. I've edited my previous comment to add "I assume".

This test currently fails. It illustrates that Unicode in the URL does not arrive intact to the server, there is silent data corruption along the way at some point. This test is for the issue #13296. PR-URL: #13297 Reviewed-By: James M Snell <jasnell@gmail.com>

assert.strictEqual message argument removed to replace with default assert message to show the expected vs actual values Refs: nodejs#13296

assert.strictEqual message argument removed to replace with default assert message to show the expected vs actual values PR-URL: nodejs#18259 Refs: nodejs#13296 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Shingo Inoue <leko.noor@gmail.com> Reviewed-By: Jon Moss <me@jonathanmoss.me> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com>

assert.strictEqual message argument removed to replace with default assert message to show the expected vs actual values PR-URL: #18259 Refs: #13296 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Shingo Inoue <leko.noor@gmail.com> Reviewed-By: Jon Moss <me@jonathanmoss.me> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com>

Trott · 2018-04-30T04:04:32Z

@nodejs/http Thoughts on what to do with this? Doc update? Code change? Neither? Something else?

TimothyGu · 2018-04-30T07:39:24Z

This has more or less been fixed with #20270, which will be out in v11.x.

assert.strictEqual message argument removed to replace with default assert message to show the expected vs actual values PR-URL: nodejs#18259 Refs: nodejs#13296 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Shingo Inoue <leko.noor@gmail.com> Reviewed-By: Jon Moss <me@jonathanmoss.me> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com>

mscdex added the http Issues or PRs related to the http subsystem. label May 30, 2017

seishun changed the title ~~http silently corrupts the request URL when it contains Unicode~~ http silently corrupts the request URL when it contains non-Latin-1 codepoints Jun 1, 2017

bnoordhuis mentioned this issue Oct 16, 2017

http: disallow two-byte characters in URL path #16237

Closed

3 tasks

Mickael-van-der-Beek mentioned this issue Nov 29, 2017

Node.js / HTTP-Parser not handling UTF-8 encoded HTTP header values #17390

Closed

ryanmahan added a commit to ryanmahan/node that referenced this issue Jan 19, 2018

test: change assert message to default

45a938a

assert.strictEqual message argument removed to replace with default assert message to show the expected vs actual values Refs: nodejs#13296

ryanmahan mentioned this issue Jan 19, 2018

test: change assert message to default #18259

Closed

4 tasks

bnoordhuis mentioned this issue Jan 27, 2018

http module does not enforce spec #18405

Closed

TimothyGu closed this as completed Apr 30, 2018

cloudrac3r mentioned this issue Nov 20, 2018

No accentuated characters cloudrac3r/cadencegq#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`http` silently corrupts the request URL when it contains non-Latin-1 codepoints #13296

`http` silently corrupts the request URL when it contains non-Latin-1 codepoints #13296

Flimm commented May 30, 2017 •

edited

Loading

jasnell commented May 30, 2017

Flimm commented May 30, 2017 •

edited

Loading

seishun commented May 31, 2017 •

edited

Loading

bnoordhuis commented May 31, 2017

seishun commented May 31, 2017

jasnell commented May 31, 2017

seishun commented Jun 1, 2017 •

edited

Loading

seishun commented Jun 2, 2017

Flimm commented Jun 13, 2017

Trott commented Apr 30, 2018

TimothyGu commented Apr 30, 2018

http silently corrupts the request URL when it contains non-Latin-1 codepoints #13296

http silently corrupts the request URL when it contains non-Latin-1 codepoints #13296

Comments

Flimm commented May 30, 2017 • edited Loading

Issue description

jasnell commented May 30, 2017

Flimm commented May 30, 2017 • edited Loading

seishun commented May 31, 2017 • edited Loading

bnoordhuis commented May 31, 2017

seishun commented May 31, 2017

jasnell commented May 31, 2017

seishun commented Jun 1, 2017 • edited Loading

seishun commented Jun 2, 2017

Flimm commented Jun 13, 2017

Trott commented Apr 30, 2018

TimothyGu commented Apr 30, 2018

`http` silently corrupts the request URL when it contains non-Latin-1 codepoints #13296

`http` silently corrupts the request URL when it contains non-Latin-1 codepoints #13296

Flimm commented May 30, 2017 •

edited

Loading

Flimm commented May 30, 2017 •

edited

Loading

seishun commented May 31, 2017 •

edited

Loading

seishun commented Jun 1, 2017 •

edited

Loading