Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for utf8 chars #7

Merged
merged 13 commits into from
Jan 20, 2016
Merged

Support for utf8 chars #7

merged 13 commits into from
Jan 20, 2016

Conversation

mcollina
Copy link
Owner

Fixes #6.

However, I am not happy with this solution, because it depends on https://www.npmjs.com/package/utf-8-validate.

This is a highly popular module, so possibly we should look for a different solution that is JS-only, possibly the same approach that is used by https://github.com/mafintosh/csv-parser, which is very fast.

@mafintosh @maxogden I might need some of your help here

@mcollina
Copy link
Owner Author

Removed utf-8-validate, as not really needed.

@mcollina
Copy link
Owner Author

Wow, this is already 10% faster than the previous implementation.

The main difference here is that we depend on bl.

Anybody sees a problem with that?

@mcollina
Copy link
Owner Author

@heroboy please test.

push(this, this.mapper(list[i]))
if (needsSplit) {
list = this._list.toString('utf8').split(this.matcher)
if (list) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this always fire? [] is still thruthy.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, good spot. Removed.

@heroboy
Copy link

heroboy commented Jan 19, 2016

Thanks. It works.
I have two suggestions to improve performance.

  1. Use https://nodejs.org/api/string_decoder.html to convert chunks to string, it will remember the last unused bytes. Then you don't need to store the buffer list.
  2. add {decodeStrings:false} in through options,then if code like:
var s = split();
s.write("aaaaa");

the strings will not be converted to Buffer,and you convert it back again. But you need check the type of chunk.

@mcollina
Copy link
Owner Author

  1. no you would need to store things, as that remember only the last part of a multibyte utf8 string, not the full line
  2. possibly, but then I won't be able to use bl.

Why don't you send a PR, and we compare the perf?

@heroboy
Copy link

heroboy commented Jan 19, 2016

The second line is broken.

var s = split();
s.on("data", data=> { 
    console.log(data);
})

var str = "烫烫烫\r\n烫烫烫";
var buf = new Buffer(str, "utf8");
for (var i = 0; i < buf.length;i+=2)
{
    s.write(buf.slice(i, i + 2));
}
s.end();

@heroboy
Copy link

heroboy commented Jan 19, 2016

  1. I don't know how to use git.
  2. Version control system like git that need downloads all history , the usage experience is not good in China's network.
    But I will try to make my version and compare the perf.

@mcollina
Copy link
Owner Author

Please try this one, it's based on StringDecoder, it should be good, and performance is equivalent as the current version.

@heroboy
Copy link

heroboy commented Jan 19, 2016

I think this one is right.
This is my bench result.
for small file "package.json"

benchSplit*10000: 1394.734ms
benchOldSplit*10000: 1327.251ms
benchThrough*10000: 1062.809ms
benchSplit*10000: 1125.691ms
benchOldSplit*10000: 1156.593ms
benchThrough*10000: 985.630ms

for large file(about 3MB)

benchSplit*100: 3900.208ms
benchOldSplit*100: 4079.545ms
benchThrough*100: 200.518ms
benchSplit*100: 3863.971ms
benchOldSplit*100: 3941.405ms
benchThrough*100: 195.780ms

So I think large file bench is better, it make big difference between through() and split().

@mcollina
Copy link
Owner Author

what does benchThrough?

@heroboy
Copy link

heroboy commented Jan 19, 2016

Just replace split() with require('through2')()

@mcollina
Copy link
Owner Author

awesome, are you ok to merge then?

@heroboy
Copy link

heroboy commented Jan 19, 2016

Very ok.

@mcollina
Copy link
Owner Author

@mafintosh any other opinion on this one?

, remaining = list.pop()
this._last += this._decoder.write(chunk)

var list = this._last.toString('utf8').split(this.matcher)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toString() is not needed when we use a string decoder

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done :)

@mafintosh
Copy link
Collaborator

@mcollina other than my minor nitpick comment i'm 👍

@heroboy
Copy link

heroboy commented Jan 19, 2016

I think flush should be this:

function flush(cb) {
  var str = this._last + this._decoder.end();
  if (str )
    push(this, this.mapper(str))

  cb()
}

@yoshuawuyts
Copy link

@heroboy

Version control system like git that need downloads all history

For future reference: git allows you to specify a --depth flag when cloning which greatly improves download speeds. Cheers! ✨

$ git clone --depth 1 git@github.com:mcollina/split2.git

@mcollina
Copy link
Owner Author

@heroboy I've tried to have a test case where that flush is needed, but I can't come up with anything.

Also doing var str = this._last + this._decoder.end(); is going to return a broken utf8 string.

@heroboy
Copy link

heroboy commented Jan 19, 2016

I'm not very clear about the problem. If the input buffer is not a valid utf8 string, the decoder.end() will use last unused bytes to form an incorrect character, is this needed? And maybe future StringDecoder will cache bytes and emit strings later?

@mcollina
Copy link
Owner Author

flush is called after everything coming has been processed, so it means that an incomplete utf8 char has been written, and probably the stream has been cut. If there is anything left in the StringDecoder, probably it's better to emit an error, or just skip it.

@mafintosh any opinions on this one?

@heroboy
Copy link

heroboy commented Jan 19, 2016

through({encoding:"utf8"}) this one doesn't emit an error.

var s = through({encoding:"utf8"});
s.on("data", data=> { 
    console.log("data",data);
})

var str = "烫烫烫\r\n烫烫烫";
var buf = new Buffer(str, "utf8");
for (var i = 0; i < buf.length-1;i+=1)
{
    s.write(buf.slice(i, i + 1));
}
s.end();

@mcollina
Copy link
Owner Author

@heroboy I've just pushed a test that reproduce the problem I'm speaking about:

https://github.com/mcollina/split2/blob/utf-8-support/test.js#L270-L285

This writes a truncated utf-8 char. If that char is not completed, it does not push anything down regarding that char.

@heroboy
Copy link

heroboy commented Jan 19, 2016

But in other node module, if the input utf8 buffer is not completed,it will output the last "incorrect" chars.

var fs = require("fs");
var buf = new Buffer("烫烫", "utf8");
fs.writeFileSync("broken.txt", buf.slice(0, buf.length - 1));
var s = fs.readFileSync("broken.txt", "utf8");
console.log(s);// output is "烫��"

@mcollina
Copy link
Owner Author

Done, let me know @heroboy.

@heroboy
Copy link

heroboy commented Jan 20, 2016

Great. I think there are no problems. And thank you very much make effort to fix this issue.

@mcollina mcollina merged commit a9825aa into master Jan 20, 2016
@mcollina mcollina deleted the utf-8-support branch June 28, 2016 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants