Support for utf8 chars #7

mcollina · 2016-01-19T10:32:01Z

Fixes #6.

However, I am not happy with this solution, because it depends on https://www.npmjs.com/package/utf-8-validate.

This is a highly popular module, so possibly we should look for a different solution that is JS-only, possibly the same approach that is used by https://github.com/mafintosh/csv-parser, which is very fast.

@mafintosh @maxogden I might need some of your help here

mcollina · 2016-01-19T10:35:30Z

Removed utf-8-validate, as not really needed.

mcollina · 2016-01-19T10:41:42Z

Wow, this is already 10% faster than the previous implementation.

The main difference here is that we depend on bl.

Anybody sees a problem with that?

mcollina · 2016-01-19T10:42:06Z

@heroboy please test.

mafintosh · 2016-01-19T10:57:32Z

index.js

-    push(this, this.mapper(list[i]))
+  if (needsSplit) {
+    list = this._list.toString('utf8').split(this.matcher)
+    if (list) {


won't this always fire? [] is still thruthy.

you are right, good spot. Removed.

heroboy · 2016-01-19T12:02:11Z

Thanks. It works.
I have two suggestions to improve performance.

Use https://nodejs.org/api/string_decoder.html to convert chunks to string, it will remember the last unused bytes. Then you don't need to store the buffer list.
add {decodeStrings:false} in through options,then if code like:

var s = split();
s.write("aaaaa");

the strings will not be converted to Buffer,and you convert it back again. But you need check the type of chunk.

mcollina · 2016-01-19T12:09:38Z

no you would need to store things, as that remember only the last part of a multibyte utf8 string, not the full line
possibly, but then I won't be able to use bl.

Why don't you send a PR, and we compare the perf?

heroboy · 2016-01-19T12:18:20Z

The second line is broken.

var s = split();
s.on("data", data=> { 
    console.log(data);
})

var str = "烫烫烫\r\n烫烫烫";
var buf = new Buffer(str, "utf8");
for (var i = 0; i < buf.length;i+=2)
{
    s.write(buf.slice(i, i + 2));
}
s.end();

heroboy · 2016-01-19T12:27:34Z

I don't know how to use git.
Version control system like git that need downloads all history , the usage experience is not good in China's network.
But I will try to make my version and compare the perf.

mcollina · 2016-01-19T12:28:57Z

Please try this one, it's based on StringDecoder, it should be good, and performance is equivalent as the current version.

heroboy · 2016-01-19T13:19:29Z

I think this one is right.
This is my bench result.
for small file "package.json"

benchSplit*10000: 1394.734ms
benchOldSplit*10000: 1327.251ms
benchThrough*10000: 1062.809ms
benchSplit*10000: 1125.691ms
benchOldSplit*10000: 1156.593ms
benchThrough*10000: 985.630ms

for large file(about 3MB)

benchSplit*100: 3900.208ms
benchOldSplit*100: 4079.545ms
benchThrough*100: 200.518ms
benchSplit*100: 3863.971ms
benchOldSplit*100: 3941.405ms
benchThrough*100: 195.780ms

So I think large file bench is better, it make big difference between through() and split().

mcollina · 2016-01-19T13:39:22Z

what does benchThrough?

heroboy · 2016-01-19T13:49:16Z

Just replace split() with require('through2')()

mcollina · 2016-01-19T13:53:12Z

awesome, are you ok to merge then?

heroboy · 2016-01-19T13:57:41Z

Very ok.

mcollina · 2016-01-19T14:02:56Z

@mafintosh any other opinion on this one?

mafintosh · 2016-01-19T14:08:24Z

index.js

-    , remaining = list.pop()
+  this._last += this._decoder.write(chunk)
+
+  var list = this._last.toString('utf8').split(this.matcher)


toString() is not needed when we use a string decoder

mafintosh · 2016-01-19T14:09:26Z

@mcollina other than my minor nitpick comment i'm 👍

heroboy · 2016-01-19T14:14:24Z

I think flush should be this:

function flush(cb) {
  var str = this._last + this._decoder.end();
  if (str )
    push(this, this.mapper(str))

  cb()
}

yoshuawuyts · 2016-01-19T14:22:37Z

@heroboy

Version control system like git that need downloads all history

For future reference: git allows you to specify a --depth flag when cloning which greatly improves download speeds. Cheers! ✨

$ git clone --depth 1 git@github.com:mcollina/split2.git

mcollina · 2016-01-19T14:27:49Z

@heroboy I've tried to have a test case where that flush is needed, but I can't come up with anything.

Also doing var str = this._last + this._decoder.end(); is going to return a broken utf8 string.

heroboy · 2016-01-19T14:40:08Z

I'm not very clear about the problem. If the input buffer is not a valid utf8 string, the decoder.end() will use last unused bytes to form an incorrect character, is this needed? And maybe future StringDecoder will cache bytes and emit strings later?

mcollina · 2016-01-19T14:57:06Z

flush is called after everything coming has been processed, so it means that an incomplete utf8 char has been written, and probably the stream has been cut. If there is anything left in the StringDecoder, probably it's better to emit an error, or just skip it.

@mafintosh any opinions on this one?

heroboy · 2016-01-19T15:11:13Z

through({encoding:"utf8"}) this one doesn't emit an error.

var s = through({encoding:"utf8"});
s.on("data", data=> { 
    console.log("data",data);
})

var str = "烫烫烫\r\n烫烫烫";
var buf = new Buffer(str, "utf8");
for (var i = 0; i < buf.length-1;i+=1)
{
    s.write(buf.slice(i, i + 1));
}
s.end();

mcollina · 2016-01-19T15:17:46Z

@heroboy I've just pushed a test that reproduce the problem I'm speaking about:

https://github.com/mcollina/split2/blob/utf-8-support/test.js#L270-L285

This writes a truncated utf-8 char. If that char is not completed, it does not push anything down regarding that char.

heroboy · 2016-01-19T15:38:30Z

But in other node module, if the input utf8 buffer is not completed,it will output the last "incorrect" chars.

var fs = require("fs");
var buf = new Buffer("烫烫", "utf8");
fs.writeFileSync("broken.txt", buf.slice(0, buf.length - 1));
var s = fs.readFileSync("broken.txt", "utf8");
console.log(s);// output is "烫��"

mcollina · 2016-01-19T15:47:48Z

Done, let me know @heroboy.

heroboy · 2016-01-20T01:52:45Z

Great. I think there are no problems. And thank you very much make effort to fix this issue.

mcollina added 2 commits January 19, 2016 11:27

Support for utf8 chars.

7bb5d82

Removed utf-8-validate

91b91ca

Added a bench.

b9a5739

mafintosh reviewed Jan 19, 2016
View reviewed changes

Remove useless if.

60a3a9d

mcollina added 2 commits January 19, 2016 13:23

Added one more test for utf8

d8ae815

Use StringDecoder.

2d3b338

Better test naming.

764203c

mafintosh reviewed Jan 19, 2016
View reviewed changes

One more test.

4791420

mcollina added 2 commits January 19, 2016 15:28

Moved to standard.

e1b05f2

Do not call toString('utf8') before split

88ab445

Added standard as a devdep.

98a6567

Added test for truncated scenario.

6481984

Forward any gibberish left in StringDecoder.

a9825aa

mcollina merged commit a9825aa into master Jan 20, 2016

mcollina deleted the utf-8-support branch June 28, 2016 08:18

mcollina mentioned this pull request Jun 28, 2016

stream: support decoding buffers for Writables nodejs/node#7425

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for utf8 chars #7

Support for utf8 chars #7

mcollina commented Jan 19, 2016

mcollina commented Jan 19, 2016

mcollina commented Jan 19, 2016

mcollina commented Jan 19, 2016

mafintosh Jan 19, 2016

mcollina Jan 19, 2016

heroboy commented Jan 19, 2016

mcollina commented Jan 19, 2016

heroboy commented Jan 19, 2016

heroboy commented Jan 19, 2016

mcollina commented Jan 19, 2016

heroboy commented Jan 19, 2016

mcollina commented Jan 19, 2016

heroboy commented Jan 19, 2016

mcollina commented Jan 19, 2016

heroboy commented Jan 19, 2016

mcollina commented Jan 19, 2016

mafintosh Jan 19, 2016

mcollina Jan 19, 2016

mafintosh commented Jan 19, 2016

heroboy commented Jan 19, 2016

yoshuawuyts commented Jan 19, 2016

mcollina commented Jan 19, 2016

heroboy commented Jan 19, 2016

mcollina commented Jan 19, 2016

heroboy commented Jan 19, 2016

mcollina commented Jan 19, 2016

heroboy commented Jan 19, 2016

mcollina commented Jan 19, 2016

heroboy commented Jan 20, 2016

Support for utf8 chars #7

Support for utf8 chars #7

Conversation

mcollina commented Jan 19, 2016

mcollina commented Jan 19, 2016

mcollina commented Jan 19, 2016

mcollina commented Jan 19, 2016

mafintosh Jan 19, 2016

Choose a reason for hiding this comment

mcollina Jan 19, 2016

Choose a reason for hiding this comment

heroboy commented Jan 19, 2016

mcollina commented Jan 19, 2016

heroboy commented Jan 19, 2016

heroboy commented Jan 19, 2016

mcollina commented Jan 19, 2016

heroboy commented Jan 19, 2016

mcollina commented Jan 19, 2016

heroboy commented Jan 19, 2016

mcollina commented Jan 19, 2016

heroboy commented Jan 19, 2016

mcollina commented Jan 19, 2016

mafintosh Jan 19, 2016

Choose a reason for hiding this comment

mcollina Jan 19, 2016

Choose a reason for hiding this comment

mafintosh commented Jan 19, 2016

heroboy commented Jan 19, 2016

yoshuawuyts commented Jan 19, 2016

mcollina commented Jan 19, 2016

heroboy commented Jan 19, 2016

mcollina commented Jan 19, 2016

heroboy commented Jan 19, 2016

mcollina commented Jan 19, 2016

heroboy commented Jan 19, 2016

mcollina commented Jan 19, 2016

heroboy commented Jan 20, 2016