Buffer Incorrectly Encodes "\0" as 32 for ASCII Encoding #297

bigeasy · 2010-09-21T21:08:15Z

To reproduce:

(new Buffer("\0", "ascii"))[0] == 32

Correct for UTF-8:

(new Buffer("\0", "utf8"))[0] == 0

For reference, here is an ASCII table.

The text was updated successfully, but these errors were encountered:

ry · 2010-09-21T22:25:12Z

The reason for it, is this line:

http://github.com/ry/node/blob/9922e4e433996722a76edb46d14f1729f33b4bed/deps/v8/src/api.cc#L3005

which I am not sure the purpose of... I'll ask in v8-users.

ry · 2010-09-21T22:37:29Z

http://groups.google.com/group/v8-users/browse_thread/thread/2b3c9fedbe35fd3f

bnoordhuis · 2011-07-04T12:15:08Z

Side note: the behaviour for UTF-8 has changed since this issue was raised.

> typeof (new Buffer("\0", "utf8"))[0]
'undefined'

Perhaps we should add this as a caveat emptor to the docs and tell people to pass non-printable characters as an array instead of a string, e.g. [0] instead of "\0" or "\u0000"?

koichik · 2011-07-09T17:23:58Z

How about d39f23a?

bigeasy · 2011-07-10T06:16:28Z

I am so confused. Don't we have two bugs now? The "ascii" encoding converts '\0' to 32, which is a space, ' ', and 'utf8' converts to undefined? So, the documentation patch not only excuses a bug, it gives you instructions that do not solve the problem, but instead triggers a different bug.

Why does the character with the code 0 not translate to a number 0?

The behavior is inconsistent, seemingly arbitrary:

[alan@postojna ~]$ node
> "\0".charAt(0)
'\u0000'
> "\0".charCodeAt(0)
0
> typeof (new Buffer("\0", "utf8"))[0]
'undefined'
> new Buffer([0]).toString("utf8")
'\u0000'
> typeof (new Buffer("\0", "ascii"))[0]
'number'
>  (new Buffer("\0", "ascii"))[0]
32

What is the reasoning behind any of it?

koichik · 2011-07-10T07:21:01Z

Yes, we have 2 problems.

The 'utf8' encoding problem is same as #394, we can fix it (#1210).
Therefore I did not write it for the document.

The 'ascii' encoding problem (this issue) is V8's matter.
If we fix it, Node will copy bytes oneself, but it may be slower.
Then, I think that we won't fix it.

bnoordhuis · 2011-07-10T14:46:59Z

@bigeasy

The bug with UTF-8 is that if the last byte is a nul byte, it gets stripped. Consider:

> Buffer('\0')
<Buffer >
> Buffer('a\0')
<Buffer 61>
> Buffer('a\0b')
<Buffer 61 00 62>
> Buffer('a\0b\0')
<Buffer 61 00 62>

koichik posted a patch for that in #394.

bigeasy · 2011-07-10T18:41:56Z

Wow. That is really, really bad. Who is implementing this upstream?

Can we implement a correct encoding in Node.js and skip the capricious V8 implementations? They are not right and there is no justification for why they are the way they are.

Why does Node.js show deference to them? Maybe in Chrome these encodings don't matter, but Node.js is for network programming, and for network programming, correct and fast encoding implementations matter.

bigeasy · 2011-07-10T18:58:11Z

I just noticed:

@koichik ~ The ascii encoding problem (this issue) is V8's matter. If we fix it, Node will copy bytes oneself, but it may be slower.

If it is written in C it won't be slower.

koichik · 2011-07-11T06:03:28Z

@bigeasy - I think that Node can not access string's data directly, Node needs copy twice.
first, copies of Unicode data to tmp buf using v8::String::Write.
second, copies with converting from tmp buf to target Buffer.

bigeasy · 2011-07-11T06:44:02Z

@koichik - Icky. I did post an inquiry a while back on the V8 mailing list. Maybe we can patch V8 and lobby for the patch inclusion? My guess is that the 32 is a copy-and-paste from somewhere else, that it doesn't solve a problem for them. The omitted UTF-8 \0 is some string conversion logic that got pushed into the encoder where it does not belong.

It will probably take some time to figure out how to argue the point. It seems obvious to me that decoding should be agnostic about the string implementation. It is probably very logical to them that 0 is stripped and adulterated. In the last exchange there, I was being told that 0 is not a valid ASCII character.

koichik closed this as completed in d05afa5 Jul 13, 2011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer Incorrectly Encodes "\0" as 32 for ASCII Encoding #297

Buffer Incorrectly Encodes "\0" as 32 for ASCII Encoding #297

bigeasy commented Sep 21, 2010

ry commented Sep 21, 2010

ry commented Sep 21, 2010

bnoordhuis commented Jul 4, 2011

koichik commented Jul 9, 2011

bigeasy commented Jul 10, 2011

koichik commented Jul 10, 2011

bnoordhuis commented Jul 10, 2011

bigeasy commented Jul 10, 2011

bigeasy commented Jul 10, 2011

koichik commented Jul 11, 2011

bigeasy commented Jul 11, 2011

Buffer Incorrectly Encodes "\0" as 32 for ASCII Encoding #297

Buffer Incorrectly Encodes "\0" as 32 for ASCII Encoding #297

Comments

bigeasy commented Sep 21, 2010

ry commented Sep 21, 2010

ry commented Sep 21, 2010

bnoordhuis commented Jul 4, 2011

koichik commented Jul 9, 2011

bigeasy commented Jul 10, 2011

koichik commented Jul 10, 2011

bnoordhuis commented Jul 10, 2011

bigeasy commented Jul 10, 2011

bigeasy commented Jul 10, 2011

koichik commented Jul 11, 2011

bigeasy commented Jul 11, 2011