Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

Buffer Incorrectly Encodes "\0" as 32 for ASCII Encoding #297

Closed
bigeasy opened this issue Sep 21, 2010 · 11 comments
Closed

Buffer Incorrectly Encodes "\0" as 32 for ASCII Encoding #297

bigeasy opened this issue Sep 21, 2010 · 11 comments
Labels

Comments

@bigeasy
Copy link

bigeasy commented Sep 21, 2010

To reproduce:

(new Buffer("\0", "ascii"))[0] == 32

Correct for UTF-8:

(new Buffer("\0", "utf8"))[0] == 0

For reference, here is an ASCII table.

@ry
Copy link

ry commented Sep 21, 2010

The reason for it, is this line:

http://github.com/ry/node/blob/9922e4e433996722a76edb46d14f1729f33b4bed/deps/v8/src/api.cc#L3005

which I am not sure the purpose of... I'll ask in v8-users.

@ry
Copy link

ry commented Sep 21, 2010

@bnoordhuis
Copy link
Member

Side note: the behaviour for UTF-8 has changed since this issue was raised.

> typeof (new Buffer("\0", "utf8"))[0]
'undefined'

Perhaps we should add this as a caveat emptor to the docs and tell people to pass non-printable characters as an array instead of a string, e.g. [0] instead of "\0" or "\u0000"?

@koichik
Copy link

koichik commented Jul 9, 2011

How about d39f23a?

@bigeasy
Copy link
Author

bigeasy commented Jul 10, 2011

I am so confused. Don't we have two bugs now? The "ascii" encoding converts '\0' to 32, which is a space, ' ', and 'utf8' converts to undefined? So, the documentation patch not only excuses a bug, it gives you instructions that do not solve the problem, but instead triggers a different bug.

Why does the character with the code 0 not translate to a number 0?

The behavior is inconsistent, seemingly arbitrary:

[alan@postojna ~]$ node
> "\0".charAt(0)
'\u0000'
> "\0".charCodeAt(0)
0
> typeof (new Buffer("\0", "utf8"))[0]
'undefined'
> new Buffer([0]).toString("utf8")
'\u0000'
> typeof (new Buffer("\0", "ascii"))[0]
'number'
>  (new Buffer("\0", "ascii"))[0]
32

What is the reasoning behind any of it?

@koichik
Copy link

koichik commented Jul 10, 2011

Yes, we have 2 problems.

The 'utf8' encoding problem is same as #394, we can fix it (#1210).
Therefore I did not write it for the document.

The 'ascii' encoding problem (this issue) is V8's matter.
If we fix it, Node will copy bytes oneself, but it may be slower.
Then, I think that we won't fix it.

@bnoordhuis
Copy link
Member

@bigeasy

The bug with UTF-8 is that if the last byte is a nul byte, it gets stripped. Consider:

> Buffer('\0')
<Buffer >
> Buffer('a\0')
<Buffer 61>
> Buffer('a\0b')
<Buffer 61 00 62>
> Buffer('a\0b\0')
<Buffer 61 00 62>

koichik posted a patch for that in #394.

@bigeasy
Copy link
Author

bigeasy commented Jul 10, 2011

Wow. That is really, really bad. Who is implementing this upstream?

Can we implement a correct encoding in Node.js and skip the capricious V8 implementations? They are not right and there is no justification for why they are the way they are.

Why does Node.js show deference to them? Maybe in Chrome these encodings don't matter, but Node.js is for network programming, and for network programming, correct and fast encoding implementations matter.

@bigeasy
Copy link
Author

bigeasy commented Jul 10, 2011

I just noticed:

@koichik ~ The ascii encoding problem (this issue) is V8's matter. If we fix it, Node will copy bytes oneself, but it may be slower.

If it is written in C it won't be slower.

@koichik
Copy link

koichik commented Jul 11, 2011

@bigeasy - I think that Node can not access string's data directly, Node needs copy twice.
first, copies of Unicode data to tmp buf using v8::String::Write.
second, copies with converting from tmp buf to target Buffer.

@bigeasy
Copy link
Author

bigeasy commented Jul 11, 2011

@koichik - Icky. I did post an inquiry a while back on the V8 mailing list. Maybe we can patch V8 and lobby for the patch inclusion? My guess is that the 32 is a copy-and-paste from somewhere else, that it doesn't solve a problem for them. The omitted UTF-8 \0 is some string conversion logic that got pushed into the encoder where it does not belong.

It will probably take some time to figure out how to argue the point. It seems obvious to me that decoding should be agnostic about the string implementation. It is probably very logical to them that 0 is stripped and adulterated. In the last exchange there, I was being told that 0 is not a valid ASCII character.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants