Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node.js / HTTP-Parser not handling UTF-8 encoded HTTP header values #17390

Closed
Mickael-van-der-Beek opened this issue Nov 29, 2017 · 10 comments
Closed
Labels
http_parser Issues and PRs related to the HTTP Parser dependency or the http_parser binding. http Issues or PRs related to the http subsystem.

Comments

@Mickael-van-der-Beek
Copy link

Hi,

I've encountered an issue regarding the way HTTP header values are decoded.
The HTTP Parser project might be a better place to post this issue to but I thought I'd post here first.

Currently it would seem that Node.js is decoding HTTP header values as US-ASCII / ASCII-7.

This becomes an issue now that browsers and servers started supporting UTF-8 values as well.

A simple example would be a website that has a URL who redirects to a non-percent-encoded UTF-8 URL. e.g:

const http = require('http');
const net = require('net');

const HOST = '127.0.0.1';
const PORT = 3000;

const server = net.createServer(res => {
  res.end([
    'HTTP/1.1 301 Moved Permanently',
    'Location: /fÖÖbÃÃr',
    '\r\n'
  ].join('\r\n'));
});

server.listen(PORT, HOST, err => {
  const req = http.request({
    hostname: HOST,
    port: PORT,
    path: '/'
  }, res => {
    console.log(`Location: ${res.headers.location}`);   
    console.log(`Location: ${Buffer.from(res.headers.location, 'binary').toString('utf8')}`);   
  });

  req.end();
});

The first log will produce: Location: /fÃ�Ã�bÃ�Ã�r (there are invisible characters next to the Ã's).
The second log will produce: Location: /fÖÖbÃÃr, which is the correct and expected result

In this example, to follow the redirect, you'd need to first instantiate a buffer in 'binary' encoding and then stringify it to it's 'utf8' representation.

The original RFC2616 that defined HTTP seemed to allow for any byte value with a few restrictions on control characters:

TEXT = <any OCTET except CTLs, but including LWS>

cf: https://tools.ietf.org/html/rfc7230#section-3.2.6
cf: https://tools.ietf.org/html/rfc2616#section-2.2

The follow-up update to HTTP, RFC7230, seems to change that and restrict them to US-ASCII / ASCII-7:

Non-US-ASCII content in header fields and the reason phrase has been obsoleted and made opaque (the TEXT rule was removed). (Section 3.2.6)

cf: https://tools.ietf.org/html/rfc7230#appendix-A.2

I would expect most Node.js HTTP clients to thus fail on the example I provided above.

Since browsers seems to support it and servers started sending it (I've seen examples in the wild), I think we can say that it has become a de facto standard and that it would be nice if either Node.js core or HTTP Parser would support reading HTTP header values as UTF-8 by default.

  • Version: Node.js v6.x and v8.x
  • Platform: OSX Darwin Kernel Version 15.6.0
  • Subsystem: http, http-parser
@Mickael-van-der-Beek
Copy link
Author

This issue might also be related to:

#2629

and

#13296
#16237

@apapirovski
Copy link
Member

apapirovski commented Nov 29, 2017

Since browsers seems to support it and servers started sending it (I've seen examples in the wild), I think we can say that it has become a de facto standard and that it would be nice if either Node.js core or HTTP Parser would support reading HTTP header values as UTF-8 by default.

I don't think Node should support something that goes against the spec. In this case Node assumes that any encoding is done by the user and values received conform to the spec.

The problem with changing this is that there's going to be a significant performance penalty.


More to the point, if you've seen examples in the wild then file a bug against those implementations and not Node. The HTTP spec doesn't (and shouldn't) evolve based on faulty implementations.

@apapirovski
Copy link
Member

/cc @nodejs/http @nodejs/http-parser

@apapirovski apapirovski added http Issues or PRs related to the http subsystem. http_parser Issues and PRs related to the HTTP Parser dependency or the http_parser binding. labels Nov 29, 2017
@Mickael-van-der-Beek
Copy link
Author

Mickael-van-der-Beek commented Nov 29, 2017

@apapirovski On the other hand, IRIs, who's value is reflected in HTTP, are starting to get adopted and gain momentum.

Maybe it's possible to have a white-list of headers like Location, Origin, Referer and Link to not mitigate the performance impact but that would seem a bit hackish and trigger obscure bugs. So it's probably a bad idea.

Setting it as an option for the request parser with a default to 'ascii' would at least make it easy for people that are ok with the trade-of to use the utf8 encoding.

@apapirovski
Copy link
Member

Even IRIs specifically mention this as being problematic in relation to http and cover percent encoding extensively.

A protocol or format element should be explicitly designated to
be able to carry IRIs. The intent is not to introduce IRIs into
contexts that are not defined to accept them. For example, XML
schema [XMLSchema] has an explicit type "anyURI" that includes
IRIs and IRI references. Therefore, IRIs and IRI references can
be in attributes and elements of type "anyURI". On the other
hand, in the HTTP protocol [RFC2616], the Request URI is defined
as a URI, which means that direct use of IRIs is not allowed in
HTTP requests.

@Mickael-van-der-Beek
Copy link
Author

@apapirovski The spec here is actually wrong or at least not specific enough.

Following RFC2616, IRIs are only not allowed in the context of the URL and path related tokens but they are as part of header values.

(By path related tokens I mean: "URI-reference", "absoluteURI", "relativeURI", "port", "host", "abs_path", "rel_path" and "authority" defined in RFC1808 and RFC2396 and used in RFC2616.)

Historically standards haven't always been very precise or reflecting the current state of affairs which is how the concept of "de facto standard" appeared. This on top of the fact that before RFC7230, the bytes of a UTF-8 representation were actually allowed by spec to be transmitted.

If Node.js should follow the standard or de facto standard is of course debatable but I would argue that it wouldn't be the first time as suggested by the relatively recent URL parser implementation in Node.js 7.x. e.g:

new url.URL('http://127.0x00.0000.1/').toString() === 'http://127.0.0.1/' // true

Which shows that Node.js's URL parser will accept the explicitly forbidden (rare IP address formats)[https://tools.ietf.org/html/rfc3986#section-7.4] so as to correctly convert it to the dotted decimal format.

@jasnell
Copy link
Member

jasnell commented Nov 30, 2017

RFC 7230 is quite specific on the matter,

     token          = 1*tchar
     tchar          = "!" / "#" / "$" / "%" / "&" / "'" / "*"
                    / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~"
                    / DIGIT / ALPHA
                    ; any VCHAR, except delimiters

And here:

   Historically, HTTP has allowed field content with text in the
   ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
   through use of [RFC2047] encoding.  In practice, most HTTP header
   field values use only a subset of the US-ASCII charset [USASCII].
   Newly defined header fields SHOULD limit their field values to
   US-ASCII octets.  A recipient SHOULD treat other octets in field
   content (obs-text) as opaque data.

The spec cannot be "wrong", but it is possible that some implementations of the spec are not consistent or compliant with the spec.

Specifically, the rule in the spec that says "A recipient SHOULD treat octets in field content (obs-text) as opaque data" specifically means that implementations are not supposed to interpret UTF-8 bytes.

Before RFC 7230, RFC 2616 was ambiguous about header values. The spec did not say anything about how header value bytes were to be interpreted so any claim that UTF-8 was allowed by the spec is incorrect. Pre RFC 7230, header values were strictly opaque sequences of octets. RFC 7230 specifically eliminates that ambiguity. Extended characters, including those contained within IRI's are to be handled using RFC2047 encoding and anything beyond that is considered opaque, uninterpreted data.

The new WHATWG URL parser is a fundamentally different thing, with a different set of rules. It was written to conform to the WHATWG URL Standard and not to RFC 3986.

@apapirovski
Copy link
Member

I would say this has been answered in a satisfactory manner by @jasnell. Please feel free to reopen if you believe I've closed this in error or there's any new information.

@ilijaz
Copy link

ilijaz commented May 29, 2019

Why not support UTF-8 strings in headers?
In instance, I have SOAP RPC with UTF-8 soap action name that puts in HTTP Header. Why I can't use Nodejs for this? Nodejs throws validation error on this header and I can't avoid this. What do I suppose to do? An answer is just to use PHP or other languages.
Are you sure that it's right to strictly follow all specs, sit and wait when they will be changed? I'm disappointed...

@xgqfrms
Copy link

xgqfrms commented Oct 23, 2023

for who use the express.js server

UTF-8

app.get('/sse', (req, res) => {
  res.writeHead(200, {
    // 'Content-Type': 'image/jpg',
    // 'Content-Type': 'image/png',
    // 'Content-Type': 'application/json',
    // 'Content-Type': 'text/event-stream',
    'Content-Type': 'text/event-stream; charset=utf-8',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    // 'Access-Control-Allow-Origin': "*",
    // 'CharacterEncoding': 'UTF-8',
    // 'Accept-Encoding': 'UTF-8',
  });
  // response.setContentType("text/event-stream");
  // response.setCharacterEncoding("UTF-8");
  let i = 0;
  let uid = setInterval(() => {
    console.log(`i`, i);
    if(i < 10) {
      i++;
      const data = convertImageToBase64URL("./public/test.png");
      console.log(`SSE ${i}`, data.slice(0, 100));
      // ❌
      // res.write(data);
      // event id `lastEventId` ✅
      const id = 2023;
      res.write(`id: ${id}\n`);
      // event type `message` \n
      // event data `data` \n\n
      // res.write(`event: message\n`);
      // custom event type `custom_message`\n ✅
      res.write(`event: custom_event_message\n`);
      res.write(`data: ${data}\n\n`);
    } else {
      clearInterval(uid);
    }
  }, 3000);
});
// http://localhost:3000/sse

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http_parser Issues and PRs related to the HTTP Parser dependency or the http_parser binding. http Issues or PRs related to the http subsystem.
Projects
None yet
Development

No branches or pull requests

5 participants