Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string_decoder: support Uint8Array input to methods #11613

Closed
wants to merge 3 commits into from

Conversation

addaleax
Copy link
Member

This includes a bit of refactoring for the Buffer internals to keep up performance. Some quick benchmark results (only the string-decoder benchmark, excluding the bigger input/chunk sizes and with reduced n):

$ ./node benchmark/compare.js --new ./node --old ./node-d08836003c57 --runs 5 --filter string-decoder.js string_decoder| Rscript benchmark/compare.R
[00:01:37|% 100| 1/1 files | 10/10 runs | 20/20 configs]: Done
                                                                                      improvement confidence      p.value
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="ascii"            14.65 %        *** 1.092735e-06
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="base64-ascii"      8.52 %        *** 6.202910e-05
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="base64-utf8"       5.27 %            7.176679e-02
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="utf16le"           4.37 %          * 1.891394e-02
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="utf8"             14.01 %        *** 1.475088e-04
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="ascii"             20.33 %        *** 1.058625e-08
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="base64-ascii"       8.51 %          * 1.025246e-02
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="base64-utf8"       -0.67 %            8.260933e-01
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="utf16le"           10.16 %        *** 6.193190e-05
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="utf8"               7.54 %        *** 6.250636e-04
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="ascii"            16.56 %        *** 1.856548e-05
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="base64-ascii"      8.68 %        *** 2.254509e-05
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="base64-utf8"       6.20 %            7.669403e-02
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="utf16le"           7.12 %        *** 2.359718e-05
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="utf8"              3.33 %          * 1.040546e-02
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="ascii"             18.07 %        *** 9.677031e-07
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="base64-ascii"      12.49 %         ** 4.455920e-03
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="base64-utf8"        2.96 %            2.299725e-01
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="utf16le"            9.82 %        *** 8.056526e-06
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="utf8"               7.09 %        *** 7.882838e-04
Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines
Affected core subsystem(s)

string_decoder, buffer

@addaleax addaleax added buffer Issues and PRs related to the buffer subsystem. semver-minor PRs that contain new features and should be released in the next minor version. string_decoder Issues and PRs related to the string_decoder subsystem. labels Feb 28, 2017
@nodejs-github-bot nodejs-github-bot added buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. string_decoder Issues and PRs related to the string_decoder subsystem. labels Feb 28, 2017
@@ -85,7 +85,7 @@ assert.strictEqual(decoder.end(), '\ufffd');

// Additional utf8Text test
decoder = new StringDecoder('utf8');
assert.strictEqual(decoder.text(Buffer.from([0x41]), 2), '');
assert.strictEqual(decoder.text(Buffer.from([0x41]), 1), '');
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mscdex What was/is testing? The 2 would always going to be out of range…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added by @EricPoker in 48f8869.

@addaleax addaleax requested review from mscdex and jasnell February 28, 2017 17:17
lib/buffer.js Outdated
@@ -432,6 +432,16 @@ Object.defineProperty(Buffer.prototype, 'offset', {
}
});

const {
hexSlice, utf8Slice, asciiSlice, latin1Slice, base64Slice, ucs2Slice
} = binding;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm really not a fan of this syntax.. but oh well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me neither. but what are the alternatives if I want to avoid the cost of property lookups? Splitting this into 6 lines? That’s something I can do if you prefer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, this is fine as is. Just had to gripe about it ;-)

@addaleax
Copy link
Member Author

addaleax commented Mar 6, 2017

@mscdex Any further thoughts on this? Otherwise I’d like to land this in the next 1 or 2 days.

return buf.toString(this.encoding);
for (const enc of [ 'latin1', 'ascii', 'hex' ]) {
const method = bufferBinding[enc + 'Slice'];
simpleWrite[enc] = (buf) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will create hidden classes, which could add to the lookup overhead.

@@ -14,6 +16,8 @@ function normalizeEncoding(enc) {
return nenc || enc;
}

const simpleWrite = {};
Copy link
Contributor

@mscdex mscdex Mar 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a prototypeless object would be better, but I'm not sure that having a lookup object like this is best.

Taking into account my comment from below, I would suggest creating a prototypeless object with the properties assigned at the same time using Object.create(null, { ... }).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it might be worthwhile comparing other lookup strategies, such as using a function that returns the correct function based on the encoding.

@mscdex
Copy link
Contributor

mscdex commented Mar 7, 2017

Shouldn't this be semver-major if we're now (explicitly) changing the behavior of strings passed to .write()?

@@ -46,9 +50,12 @@ function StringDecoder(encoding) {
this.lastChar = Buffer.allocUnsafe(nb);
}

// TODO(addaleax): This method should not accept strings as input.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comment here. Is the suggestion that in the future it should throw on a string? Otherwise the comment seems at odds with the string check below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comment here. Is the suggestion that in the future it should throw on a string? Otherwise the comment seems at odds with the string check below.

Yes… do you have different thoughts? It doesn’t really make sense to pass in a string here, does it?

Copy link
Member

@joyeecheung joyeecheung May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just leave the behavior about string inputs as it is and make it throw in another PR? (that would be semver-major, I guess) (EDIT: OK this PR is already semver-major..)

@mscdex
Copy link
Contributor

mscdex commented Mar 7, 2017

Also, have you benchmarked the Buffer changes independently?

@jasnell
Copy link
Member

jasnell commented Mar 17, 2017

ping @addaleax :-)

@addaleax addaleax force-pushed the string-decoder-uint8array branch from 7014a3e to db0181e Compare March 20, 2017 20:42
@addaleax
Copy link
Member Author

@jasnell Thanks for the ping…

I’ve rebased this and edited a bit, @mscdex was right to ask for individual benchmarks for the Buffer changes. Instead, the *Slice methods are now made available twice, once on the Buffer prototype and once on the binding.

Here’s the current benchmark situation:

$ ./node benchmark/compare.js --new ./node --old ./node-bd496e0187 --runs 15 string_decoder| Rscript benchmark/compare.R
[00:07:51|% 100| 2/2 files | 30/30 runs | 20/20 configs]: Done
                                                                                      improvement confidence      p.value
 string_decoder/string-decoder-create.js n=2500000 encoding="ascii"                      -25.22 %        *** 3.102600e-30
 string_decoder/string-decoder-create.js n=2500000 encoding="AscII"                      -14.26 %        *** 7.347390e-14
 string_decoder/string-decoder-create.js n=2500000 encoding="base64"                      -4.54 %        *** 2.573403e-07
 string_decoder/string-decoder-create.js n=2500000 encoding="ucs2"                        -6.64 %        *** 6.163783e-09
 string_decoder/string-decoder-create.js n=2500000 encoding="UTF-16LE"                    -3.15 %         ** 8.447201e-03
 string_decoder/string-decoder-create.js n=2500000 encoding="utf-8"                       -7.28 %        *** 1.657616e-17
 string_decoder/string-decoder-create.js n=2500000 encoding="utf8"                        -7.37 %        *** 7.429393e-10
 string_decoder/string-decoder-create.js n=2500000 encoding="UTF-8"                        0.21 %            8.402134e-01
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="ascii"            23.66 %        *** 2.569594e-23
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="base64-ascii"    -10.56 %        *** 1.821717e-08
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="base64-utf8"      -9.02 %        *** 1.316690e-06
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="utf16le"          11.54 %        *** 9.783854e-05
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="utf8"              5.14 %        *** 1.415359e-06
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="ascii"             28.12 %        *** 1.476634e-22
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="base64-ascii"      -9.58 %        *** 2.745890e-06
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="base64-utf8"      -10.01 %        *** 6.174522e-08
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="utf16le"            9.09 %        *** 8.884515e-07
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="utf8"               9.04 %        *** 7.448446e-10
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="ascii"            27.16 %        *** 3.700006e-14
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="base64-ascii"    -11.61 %        *** 1.095245e-07
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="base64-utf8"     -11.04 %        *** 7.624943e-08
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="utf16le"           8.53 %        *** 1.069139e-11
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="utf8"              3.68 %        *** 4.301468e-05
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="ascii"             28.03 %        *** 3.022976e-20
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="base64-ascii"      -7.10 %        *** 1.286362e-06
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="base64-utf8"       -7.74 %        *** 1.830142e-05
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="utf16le"            9.93 %        *** 6.290946e-04
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="utf8"               8.06 %        *** 8.187544e-09

I would be okay with accepting these, especially given how the improvements tend to affect the more common encodings (esp. utf8).

Shouldn't this be semver-major if we're now (explicitly) changing the behavior of strings passed to .write()?

@mscdex I wouldn’t consider string input covered as part of the API, but it’s a reasonable point of view. I’m changing the label to be careful.

@addaleax addaleax added semver-major PRs that contain breaking changes and should be released in the next major version. and removed semver-minor PRs that contain new features and should be released in the next minor version. labels Mar 20, 2017
this.end = simpleEnd;
return;
case 'ascii':
this.write = asciiText;
Copy link
Contributor

@mscdex mscdex Mar 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These additions concern me a bit because they make the function size exceed Crankshaft's max inlineable source size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mscdex … yeah. Do you have a better alternative in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to look into it.

@jasnell
Copy link
Member

jasnell commented Apr 4, 2017

ping @addaleax @mscdex .. what do you want to do with this one? If this is going to make it into 8.0.0 it needs to get landed this week. Today is technically the cut off but I'll be going the "release candidate" build next Tuesday so there's a slight bit more time.

@mscdex
Copy link
Contributor

mscdex commented Apr 4, 2017

@jasnell I plan on taking a look this week.

@mscdex
Copy link
Contributor

mscdex commented Apr 7, 2017

FWIW I think I've found some more optimizations to avoid the current performance regressions and then some. I am running benchmarks now ...

@mscdex
Copy link
Contributor

mscdex commented Apr 7, 2017

Ok, here are the results (compared to the current node master branch) with this PR + my changes to StringDecoder's encoding normalization:

                                                                                        improvement confidence      p.value
 string_decoder/string-decoder-create.js n=25000000 encoding="ascii"                       37.19 %        *** 7.731410e-79
 string_decoder/string-decoder-create.js n=25000000 encoding="AscII"                       33.64 %        *** 1.707788e-27
 string_decoder/string-decoder-create.js n=25000000 encoding="base64"                      40.86 %        *** 2.873623e-47
 string_decoder/string-decoder-create.js n=25000000 encoding="ucs2"                        27.75 %        *** 9.452042e-60
 string_decoder/string-decoder-create.js n=25000000 encoding="UTF-16LE"                    25.59 %        *** 1.022009e-54
 string_decoder/string-decoder-create.js n=25000000 encoding="utf-8"                       31.80 %        *** 1.731379e-56
 string_decoder/string-decoder-create.js n=25000000 encoding="UTF-8"                       32.08 %        *** 1.138630e-72
 string_decoder/string-decoder-create.js n=25000000 encoding="utf8"                        32.78 %        *** 4.094873e-61
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=128 encoding="ascii"            22.32 %        *** 1.476634e-48
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=128 encoding="base64-ascii"      0.88 %            4.800656e-01
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=128 encoding="base64-utf8"       0.53 %            7.488514e-01
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=128 encoding="utf16le"           9.56 %        *** 2.426733e-32
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=128 encoding="utf8"              5.75 %        *** 2.816096e-28
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=32 encoding="ascii"             21.25 %        *** 2.168653e-39
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=32 encoding="base64-ascii"       1.45 %          * 2.909510e-02
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=32 encoding="base64-utf8"        2.09 %            1.014384e-01
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=32 encoding="utf16le"            9.81 %        *** 1.713600e-09
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=32 encoding="utf8"               7.90 %        *** 2.832152e-07
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=128 encoding="ascii"            26.26 %        *** 3.872336e-32
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=128 encoding="base64-ascii"     -0.13 %            8.786887e-01
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=128 encoding="base64-utf8"      -0.43 %            6.920594e-01
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=128 encoding="utf16le"           6.79 %        *** 6.201167e-16
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=128 encoding="utf8"              4.28 %        *** 1.457799e-10
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=32 encoding="ascii"             24.15 %        *** 3.802638e-17
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=32 encoding="base64-ascii"      -0.92 %            4.692021e-01
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=32 encoding="base64-utf8"        2.65 %        *** 1.026323e-04
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=32 encoding="utf16le"            4.14 %         ** 2.253467e-03
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=32 encoding="utf8"               5.45 %        *** 9.106784e-11

@addaleax
Copy link
Member Author

addaleax commented Apr 9, 2017

@mscdex Are your modifications pushed somewhere? Are you okay with this change (possibly pending applying them)?

@mscdex
Copy link
Contributor

mscdex commented Apr 9, 2017

Not yet, I wasn't sure where to push it for review.

@addaleax
Copy link
Member Author

addaleax commented Apr 9, 2017

You can just push to this branch if you like.

@addaleax
Copy link
Member Author

CI is green. @jasnell Do you mind taking another look?

@TimothyGu
Copy link
Member

After #12223, I wonder if it would make sense to add support for all ArrayBuffer views instead of just Uint8Array, here rather than in a later PR.

@addaleax
Copy link
Member Author

@TimothyGu I am not sure that makes sense … for something that decodes byte sequences, shouldn’t the input be an Uint8Array?

];

function translateEncoding(enc) {
if (!enc) return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it affect performance if we use constants with names instead of number literals for indices?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC yes.

addaleax and others added 3 commits May 5, 2017 11:28
Makes the string slice methods of buffers available
on the binding object in addition to the `Buffer` prototype.

This enables subsequent `string_decoder` changes to use these
methods directly without performance loss, since all parameters
are known by the string decoder in those cases.
This is a bit odd since `string_decoder` does currently not
perform any type checking. Also, this adds an explicit check
for `string` input, which does not really make sense but is relied upon
by our test suite.
@addaleax addaleax force-pushed the string-decoder-uint8array branch from b67e1e9 to 525fabd Compare May 5, 2017 09:30
@addaleax
Copy link
Member Author

addaleax commented May 5, 2017

Rebased. @mscdex Do my changes LGTY? This is basically only waiting for a second CTC member approval.

inline void StringSlice<UCS2>(const FunctionCallbackInfo<Value>& args,
Local<Value> buffer_arg,
Local<Value> start_arg,
Local<Value> end_arg) {
Isolate* isolate = args.GetIsolate();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could match StringSlice() above and instead do Isolate* isolate = env->isolate(); for consistency?

@mscdex
Copy link
Contributor

mscdex commented May 5, 2017

LGTM with one minor nit that shouldn't block this from landing.

CI again: https://ci.nodejs.org/job/node-test-pull-request/7898/

sequence.forEach((write) => {
output += decoder.write(input.slice(write[0], write[1]));
for (const useUint8array of [ false, true ]) {
sequences.forEach((sequence) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While at it, change this to a for-of loop?

@TimothyGu
Copy link
Member

for something that decodes byte sequences, shouldn’t the input be an Uint8Array?

Well maybe, but I feel it is plausible for the user to use a Uint16Array for UTF-16/UCS-2 input, for example

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a polyfill for https://github.com/rvagg/string_decoder (it can be down there too!)

@BridgeAR
Copy link
Member

Needs a rebase but I guess this is otherwise pretty much good to go?

@BridgeAR BridgeAR added the stalled Issues and PRs that are stalled. label Sep 8, 2017
@BridgeAR
Copy link
Member

Ping @addaleax I would love to get this in and I think this only needs a rebase. Otherwise I would go ahead and close this.

@mscdex
Copy link
Contributor

mscdex commented Sep 12, 2017

@BridgeAR benchmarks would probably need to be re-ran before merging because this was all done before TurboFan.

@addaleax
Copy link
Member Author

@BridgeAR @mcollina’s review was dependent on there being a polyfill for the corresponding npm module, which I haven’t done yet; feel free to take this over if you like

@mscdex I agree, but I wouldn’t expect much of a difference since I don’t think TurboFan had impact on how native bindings are called

@mscdex
Copy link
Contributor

mscdex commented Sep 13, 2017

@addaleax I was referring more to the js-land stuff, especially the commit I pushed.

@BridgeAR
Copy link
Member

@mcollina I think it would be fine to land this as is for now as your PR to update the module to 8.1.2 did not yet land either. So that should be merged first out of my perspective.

@addaleax this needs a rebase though.

@mcollina
Copy link
Member

@BridgeAR this can land independently, we pick the content from core releases, so we can fetch them

But good catch on the other PR, I'll get it updated and landed. This is semver-major anyway, so we have time.

@BridgeAR
Copy link
Member

BridgeAR commented Oct 2, 2017

Ping @addaleax

@BridgeAR
Copy link
Member

Closing due to long inactivity. @addaleax please reopen if you want to follow up on this :-) (in that case the benchmarks should be rerun though).

@BridgeAR BridgeAR closed this Nov 22, 2017
@addaleax addaleax deleted the string-decoder-uint8array branch November 22, 2017 13:05
@addaleax
Copy link
Member Author

Yeah, I think there’s no point in pursuing this given that we now have TextDecoder support …

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. semver-major PRs that contain breaking changes and should be released in the next major version. stalled Issues and PRs that are stalled. string_decoder Issues and PRs related to the string_decoder subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants