Node.js / HTTP-Parser not handling UTF-8 encoded HTTP header values #17390

Mickael-van-der-Beek · 2017-11-29T17:06:52Z

Hi,

I've encountered an issue regarding the way HTTP header values are decoded.
The HTTP Parser project might be a better place to post this issue to but I thought I'd post here first.

Currently it would seem that Node.js is decoding HTTP header values as US-ASCII / ASCII-7.

This becomes an issue now that browsers and servers started supporting UTF-8 values as well.

A simple example would be a website that has a URL who redirects to a non-percent-encoded UTF-8 URL. e.g:

const http = require('http');
const net = require('net');

const HOST = '127.0.0.1';
const PORT = 3000;

const server = net.createServer(res => {
  res.end([
    'HTTP/1.1 301 Moved Permanently',
    'Location: /fÖÖbÃÃr',
    '\r\n'
  ].join('\r\n'));
});

server.listen(PORT, HOST, err => {
  const req = http.request({
    hostname: HOST,
    port: PORT,
    path: '/'
  }, res => {
    console.log(`Location: ${res.headers.location}`);   
    console.log(`Location: ${Buffer.from(res.headers.location, 'binary').toString('utf8')}`);   
  });

  req.end();
});

The first log will produce: Location: /fÃ�Ã�bÃ�Ã�r (there are invisible characters next to the Ã's).
The second log will produce: Location: /fÖÖbÃÃr, which is the correct and expected result

In this example, to follow the redirect, you'd need to first instantiate a buffer in 'binary' encoding and then stringify it to it's 'utf8' representation.

The original RFC2616 that defined HTTP seemed to allow for any byte value with a few restrictions on control characters:

TEXT = <any OCTET except CTLs, but including LWS>

cf: https://tools.ietf.org/html/rfc7230#section-3.2.6
cf: https://tools.ietf.org/html/rfc2616#section-2.2

The follow-up update to HTTP, RFC7230, seems to change that and restrict them to US-ASCII / ASCII-7:

Non-US-ASCII content in header fields and the reason phrase has been obsoleted and made opaque (the TEXT rule was removed). (Section 3.2.6)

cf: https://tools.ietf.org/html/rfc7230#appendix-A.2

I would expect most Node.js HTTP clients to thus fail on the example I provided above.

Since browsers seems to support it and servers started sending it (I've seen examples in the wild), I think we can say that it has become a de facto standard and that it would be nice if either Node.js core or HTTP Parser would support reading HTTP header values as UTF-8 by default.

Version: Node.js v6.x and v8.x
Platform: OSX Darwin Kernel Version 15.6.0
Subsystem: http, http-parser

The text was updated successfully, but these errors were encountered:

Mickael-van-der-Beek · 2017-11-29T17:08:24Z

This issue might also be related to:

#2629

and

#13296
#16237

apapirovski · 2017-11-29T17:14:06Z

Since browsers seems to support it and servers started sending it (I've seen examples in the wild), I think we can say that it has become a de facto standard and that it would be nice if either Node.js core or HTTP Parser would support reading HTTP header values as UTF-8 by default.

I don't think Node should support something that goes against the spec. In this case Node assumes that any encoding is done by the user and values received conform to the spec.

The problem with changing this is that there's going to be a significant performance penalty.

More to the point, if you've seen examples in the wild then file a bug against those implementations and not Node. The HTTP spec doesn't (and shouldn't) evolve based on faulty implementations.

apapirovski · 2017-11-29T17:17:19Z

/cc @nodejs/http @nodejs/http-parser

Mickael-van-der-Beek · 2017-11-29T17:24:21Z

@apapirovski On the other hand, IRIs, who's value is reflected in HTTP, are starting to get adopted and gain momentum.

Maybe it's possible to have a white-list of headers like Location, Origin, Referer and Link to not mitigate the performance impact but that would seem a bit hackish and trigger obscure bugs. So it's probably a bad idea.

Setting it as an option for the request parser with a default to 'ascii' would at least make it easy for people that are ok with the trade-of to use the utf8 encoding.

apapirovski · 2017-11-29T17:28:12Z

Even IRIs specifically mention this as being problematic in relation to http and cover percent encoding extensively.

A protocol or format element should be explicitly designated to
be able to carry IRIs. The intent is not to introduce IRIs into
contexts that are not defined to accept them. For example, XML
schema [XMLSchema] has an explicit type "anyURI" that includes
IRIs and IRI references. Therefore, IRIs and IRI references can
be in attributes and elements of type "anyURI". On the other
hand, in the HTTP protocol [RFC2616], the Request URI is defined
as a URI, which means that direct use of IRIs is not allowed in
HTTP requests.

Mickael-van-der-Beek · 2017-11-30T01:10:28Z

@apapirovski The spec here is actually wrong or at least not specific enough.

Following RFC2616, IRIs are only not allowed in the context of the URL and path related tokens but they are as part of header values.

(By path related tokens I mean: "URI-reference", "absoluteURI", "relativeURI", "port", "host", "abs_path", "rel_path" and "authority" defined in RFC1808 and RFC2396 and used in RFC2616.)

Historically standards haven't always been very precise or reflecting the current state of affairs which is how the concept of "de facto standard" appeared. This on top of the fact that before RFC7230, the bytes of a UTF-8 representation were actually allowed by spec to be transmitted.

If Node.js should follow the standard or de facto standard is of course debatable but I would argue that it wouldn't be the first time as suggested by the relatively recent URL parser implementation in Node.js 7.x. e.g:

new url.URL('http://127.0x00.0000.1/').toString() === 'http://127.0.0.1/' // true

Which shows that Node.js's URL parser will accept the explicitly forbidden (rare IP address formats)[https://tools.ietf.org/html/rfc3986#section-7.4] so as to correctly convert it to the dotted decimal format.

jasnell · 2017-11-30T01:38:01Z

RFC 7230 is quite specific on the matter,

     token          = 1*tchar
     tchar          = "!" / "#" / "$" / "%" / "&" / "'" / "*"
                    / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~"
                    / DIGIT / ALPHA
                    ; any VCHAR, except delimiters

And here:

   Historically, HTTP has allowed field content with text in the
   ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
   through use of [RFC2047] encoding.  In practice, most HTTP header
   field values use only a subset of the US-ASCII charset [USASCII].
   Newly defined header fields SHOULD limit their field values to
   US-ASCII octets.  A recipient SHOULD treat other octets in field
   content (obs-text) as opaque data.

The spec cannot be "wrong", but it is possible that some implementations of the spec are not consistent or compliant with the spec.

Specifically, the rule in the spec that says "A recipient SHOULD treat octets in field content (obs-text) as opaque data" specifically means that implementations are not supposed to interpret UTF-8 bytes.

Before RFC 7230, RFC 2616 was ambiguous about header values. The spec did not say anything about how header value bytes were to be interpreted so any claim that UTF-8 was allowed by the spec is incorrect. Pre RFC 7230, header values were strictly opaque sequences of octets. RFC 7230 specifically eliminates that ambiguity. Extended characters, including those contained within IRI's are to be handled using RFC2047 encoding and anything beyond that is considered opaque, uninterpreted data.

The new WHATWG URL parser is a fundamentally different thing, with a different set of rules. It was written to conform to the WHATWG URL Standard and not to RFC 3986.

apapirovski · 2017-12-06T16:23:56Z

I would say this has been answered in a satisfactory manner by @jasnell. Please feel free to reopen if you believe I've closed this in error or there's any new information.

ilijaz · 2019-05-29T09:44:00Z

Why not support UTF-8 strings in headers?
In instance, I have SOAP RPC with UTF-8 soap action name that puts in HTTP Header. Why I can't use Nodejs for this? Nodejs throws validation error on this header and I can't avoid this. What do I suppose to do? An answer is just to use PHP or other languages.
Are you sure that it's right to strictly follow all specs, sit and wait when they will be changed? I'm disappointed...

xgqfrms · 2023-10-23T15:48:50Z

for who use the `express.js` server

UTF-8

app.get('/sse', (req, res) => {
  res.writeHead(200, {
    // 'Content-Type': 'image/jpg',
    // 'Content-Type': 'image/png',
    // 'Content-Type': 'application/json',
    // 'Content-Type': 'text/event-stream',
    'Content-Type': 'text/event-stream; charset=utf-8',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    // 'Access-Control-Allow-Origin': "*",
    // 'CharacterEncoding': 'UTF-8',
    // 'Accept-Encoding': 'UTF-8',
  });
  // response.setContentType("text/event-stream");
  // response.setCharacterEncoding("UTF-8");
  let i = 0;
  let uid = setInterval(() => {
    console.log(`i`, i);
    if(i < 10) {
      i++;
      const data = convertImageToBase64URL("./public/test.png");
      console.log(`SSE ${i}`, data.slice(0, 100));
      // ❌
      // res.write(data);
      // event id `lastEventId` ✅
      const id = 2023;
      res.write(`id: ${id}\n`);
      // event type `message` \n
      // event data `data` \n\n
      // res.write(`event: message\n`);
      // custom event type `custom_message`\n ✅
      res.write(`event: custom_event_message\n`);
      res.write(`data: ${data}\n\n`);
    } else {
      clearInterval(uid);
    }
  }, 3000);
});
// http://localhost:3000/sse

apapirovski added http Issues or PRs related to the http subsystem. http_parser Issues and PRs related to the HTTP Parser dependency or the http_parser binding. labels Nov 29, 2017

apapirovski closed this as completed Dec 6, 2017

matrosov-nikita mentioned this issue Jan 2, 2018

Request is not redirected due to encoded location header request/request#2835

Closed

mlitwin mentioned this issue May 11, 2018

downloadApp fails when redirect location has funky unicode characters phonegap-build/pgb-api#1

Closed

JunTaoLuo mentioned this issue Aug 20, 2018

Configure header parsing to allow non-compliant headers aspnet/KestrelHttpServer#1144

Closed

sagebind mentioned this issue Mar 31, 2021

Redirects aren't followed if Location header contains non-ASCII symbols sagebind/isahc#315

Closed

gugu mentioned this issue Jul 30, 2021

Escape URL in reply.redirect, fixes #3217 fastify/fastify#3220

Closed

4 tasks

netroy mentioned this issue Feb 21, 2024

HTTP Request Node - Maximum number of redirects exceeded n8n-io/n8n#8701

Closed

mxschmitt mentioned this issue May 20, 2024

[Bug]: redirect in route.fetch may jump to a wrong url microsoft/playwright#30903

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node.js / HTTP-Parser not handling UTF-8 encoded HTTP header values #17390

Node.js / HTTP-Parser not handling UTF-8 encoded HTTP header values #17390

Mickael-van-der-Beek commented Nov 29, 2017

Mickael-van-der-Beek commented Nov 29, 2017

apapirovski commented Nov 29, 2017 •

edited

Loading

apapirovski commented Nov 29, 2017

Mickael-van-der-Beek commented Nov 29, 2017 •

edited

Loading

apapirovski commented Nov 29, 2017

Mickael-van-der-Beek commented Nov 30, 2017

jasnell commented Nov 30, 2017

apapirovski commented Dec 6, 2017

ilijaz commented May 29, 2019

xgqfrms commented Oct 23, 2023 •

edited

Loading

Node.js / HTTP-Parser not handling UTF-8 encoded HTTP header values #17390

Node.js / HTTP-Parser not handling UTF-8 encoded HTTP header values #17390

Comments

Mickael-van-der-Beek commented Nov 29, 2017

Mickael-van-der-Beek commented Nov 29, 2017

apapirovski commented Nov 29, 2017 • edited Loading

apapirovski commented Nov 29, 2017

Mickael-van-der-Beek commented Nov 29, 2017 • edited Loading

apapirovski commented Nov 29, 2017

Mickael-van-der-Beek commented Nov 30, 2017

jasnell commented Nov 30, 2017

apapirovski commented Dec 6, 2017

ilijaz commented May 29, 2019

xgqfrms commented Oct 23, 2023 • edited Loading

for who use the express.js server

apapirovski commented Nov 29, 2017 •

edited

Loading

Mickael-van-der-Beek commented Nov 29, 2017 •

edited

Loading

xgqfrms commented Oct 23, 2023 •

edited

Loading

for who use the `express.js` server