-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for utf8 chars #7
Conversation
Removed utf-8-validate, as not really needed. |
Wow, this is already 10% faster than the previous implementation. The main difference here is that we depend on bl. Anybody sees a problem with that? |
@heroboy please test. |
push(this, this.mapper(list[i])) | ||
if (needsSplit) { | ||
list = this._list.toString('utf8').split(this.matcher) | ||
if (list) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't this always fire? [] is still thruthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, good spot. Removed.
Thanks. It works.
the strings will not be converted to |
Why don't you send a PR, and we compare the perf? |
The second line is broken.
|
|
Please try this one, it's based on StringDecoder, it should be good, and performance is equivalent as the current version. |
I think this one is right.
for large file(about 3MB)
So I think large file bench is better, it make big difference between |
what does |
Just replace |
awesome, are you ok to merge then? |
Very ok. |
@mafintosh any other opinion on this one? |
, remaining = list.pop() | ||
this._last += this._decoder.write(chunk) | ||
|
||
var list = this._last.toString('utf8').split(this.matcher) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toString() is not needed when we use a string decoder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done :)
@mcollina other than my minor nitpick comment i'm 👍 |
I think
|
For future reference: $ git clone --depth 1 git@github.com:mcollina/split2.git |
@heroboy I've tried to have a test case where that Also doing |
I'm not very clear about the problem. If the input buffer is not a valid utf8 string, the |
@mafintosh any opinions on this one? |
|
@heroboy I've just pushed a test that reproduce the problem I'm speaking about: https://github.com/mcollina/split2/blob/utf-8-support/test.js#L270-L285 This writes a truncated utf-8 char. If that char is not completed, it does not push anything down regarding that char. |
But in other node module, if the input utf8 buffer is not completed,it will output the last "incorrect" chars.
|
Done, let me know @heroboy. |
Great. I think there are no problems. And thank you very much make effort to fix this issue. |
Fixes #6.
However, I am not happy with this solution, because it depends on https://www.npmjs.com/package/utf-8-validate.
This is a highly popular module, so possibly we should look for a different solution that is JS-only, possibly the same approach that is used by https://github.com/mafintosh/csv-parser, which is very fast.
@mafintosh @maxogden I might need some of your help here