-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
string_decoder: support Uint8Array input to methods #11613
Conversation
@@ -85,7 +85,7 @@ assert.strictEqual(decoder.end(), '\ufffd'); | |||
|
|||
// Additional utf8Text test | |||
decoder = new StringDecoder('utf8'); | |||
assert.strictEqual(decoder.text(Buffer.from([0x41]), 2), ''); | |||
assert.strictEqual(decoder.text(Buffer.from([0x41]), 1), ''); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mscdex What was/is testing? The 2
would always going to be out of range…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added by @EricPoker in 48f8869.
lib/buffer.js
Outdated
@@ -432,6 +432,16 @@ Object.defineProperty(Buffer.prototype, 'offset', { | |||
} | |||
}); | |||
|
|||
const { | |||
hexSlice, utf8Slice, asciiSlice, latin1Slice, base64Slice, ucs2Slice | |||
} = binding; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm really not a fan of this syntax.. but oh well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me neither. but what are the alternatives if I want to avoid the cost of property lookups? Splitting this into 6 lines? That’s something I can do if you prefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, this is fine as is. Just had to gripe about it ;-)
@mscdex Any further thoughts on this? Otherwise I’d like to land this in the next 1 or 2 days. |
lib/string_decoder.js
Outdated
return buf.toString(this.encoding); | ||
for (const enc of [ 'latin1', 'ascii', 'hex' ]) { | ||
const method = bufferBinding[enc + 'Slice']; | ||
simpleWrite[enc] = (buf) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will create hidden classes, which could add to the lookup overhead.
lib/string_decoder.js
Outdated
@@ -14,6 +16,8 @@ function normalizeEncoding(enc) { | |||
return nenc || enc; | |||
} | |||
|
|||
const simpleWrite = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a prototypeless object would be better, but I'm not sure that having a lookup object like this is best.
Taking into account my comment from below, I would suggest creating a prototypeless object with the properties assigned at the same time using Object.create(null, { ... })
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it might be worthwhile comparing other lookup strategies, such as using a function that returns the correct function based on the encoding.
Shouldn't this be semver-major if we're now (explicitly) changing the behavior of strings passed to |
@@ -46,9 +50,12 @@ function StringDecoder(encoding) { | |||
this.lastChar = Buffer.allocUnsafe(nb); | |||
} | |||
|
|||
// TODO(addaleax): This method should not accept strings as input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the comment here. Is the suggestion that in the future it should throw on a string? Otherwise the comment seems at odds with the string check below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the comment here. Is the suggestion that in the future it should throw on a string? Otherwise the comment seems at odds with the string check below.
Yes… do you have different thoughts? It doesn’t really make sense to pass in a string here, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just leave the behavior about string inputs as it is and make it throw in another PR? (that would be semver-major, I guess) (EDIT: OK this PR is already semver-major..)
Also, have you benchmarked the |
ping @addaleax :-) |
7014a3e
to
db0181e
Compare
@jasnell Thanks for the ping… I’ve rebased this and edited a bit, @mscdex was right to ask for individual benchmarks for the Here’s the current benchmark situation:
I would be okay with accepting these, especially given how the improvements tend to affect the more common encodings (esp.
@mscdex I wouldn’t consider string input covered as part of the API, but it’s a reasonable point of view. I’m changing the label to be careful. |
lib/string_decoder.js
Outdated
this.end = simpleEnd; | ||
return; | ||
case 'ascii': | ||
this.write = asciiText; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These additions concern me a bit because they make the function size exceed Crankshaft's max inlineable source size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mscdex … yeah. Do you have a better alternative in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have to look into it.
@jasnell I plan on taking a look this week. |
FWIW I think I've found some more optimizations to avoid the current performance regressions and then some. I am running benchmarks now ... |
Ok, here are the results (compared to the current node master branch) with this PR + my changes to
|
@mscdex Are your modifications pushed somewhere? Are you okay with this change (possibly pending applying them)? |
Not yet, I wasn't sure where to push it for review. |
You can just push to this branch if you like. |
CI is green. @jasnell Do you mind taking another look? |
After #12223, I wonder if it would make sense to add support for all ArrayBuffer views instead of just Uint8Array, here rather than in a later PR. |
@TimothyGu I am not sure that makes sense … for something that decodes byte sequences, shouldn’t the input be an Uint8Array? |
]; | ||
|
||
function translateEncoding(enc) { | ||
if (!enc) return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it affect performance if we use constants with names instead of number literals for indices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC yes.
Makes the string slice methods of buffers available on the binding object in addition to the `Buffer` prototype. This enables subsequent `string_decoder` changes to use these methods directly without performance loss, since all parameters are known by the string decoder in those cases.
This is a bit odd since `string_decoder` does currently not perform any type checking. Also, this adds an explicit check for `string` input, which does not really make sense but is relied upon by our test suite.
b67e1e9
to
525fabd
Compare
Rebased. @mscdex Do my changes LGTY? This is basically only waiting for a second CTC member approval. |
inline void StringSlice<UCS2>(const FunctionCallbackInfo<Value>& args, | ||
Local<Value> buffer_arg, | ||
Local<Value> start_arg, | ||
Local<Value> end_arg) { | ||
Isolate* isolate = args.GetIsolate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could match StringSlice()
above and instead do Isolate* isolate = env->isolate();
for consistency?
LGTM with one minor nit that shouldn't block this from landing. CI again: https://ci.nodejs.org/job/node-test-pull-request/7898/ |
sequence.forEach((write) => { | ||
output += decoder.write(input.slice(write[0], write[1])); | ||
for (const useUint8array of [ false, true ]) { | ||
sequences.forEach((sequence) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While at it, change this to a for-of loop?
Well maybe, but I feel it is plausible for the user to use a Uint16Array for UTF-16/UCS-2 input, for example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a polyfill for https://github.com/rvagg/string_decoder (it can be down there too!)
Needs a rebase but I guess this is otherwise pretty much good to go? |
Ping @addaleax I would love to get this in and I think this only needs a rebase. Otherwise I would go ahead and close this. |
@BridgeAR benchmarks would probably need to be re-ran before merging because this was all done before TurboFan. |
@BridgeAR @mcollina’s review was dependent on there being a polyfill for the corresponding npm module, which I haven’t done yet; feel free to take this over if you like @mscdex I agree, but I wouldn’t expect much of a difference since I don’t think TurboFan had impact on how native bindings are called |
@addaleax I was referring more to the js-land stuff, especially the commit I pushed. |
@BridgeAR this can land independently, we pick the content from core releases, so we can fetch them But good catch on the other PR, I'll get it updated and landed. This is semver-major anyway, so we have time. |
Ping @addaleax |
Closing due to long inactivity. @addaleax please reopen if you want to follow up on this :-) (in that case the benchmarks should be rerun though). |
Yeah, I think there’s no point in pursuing this given that we now have |
This includes a bit of refactoring for the
Buffer
internals to keep up performance. Some quick benchmark results (only thestring-decoder
benchmark, excluding the bigger input/chunk sizes and with reducedn
):Checklist
make -j4 test
(UNIX), orvcbuild test
(Windows) passesAffected core subsystem(s)
string_decoder, buffer