Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TextEncoder.prototype.encodeInto #26904

Closed
RReverser opened this issue Mar 25, 2019 · 11 comments
Closed

Add TextEncoder.prototype.encodeInto #26904

RReverser opened this issue Mar 25, 2019 · 11 comments
Assignees
Labels
buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. feature request Issues that request new features to be added to Node.js.

Comments

@RReverser
Copy link
Member

Is your feature request related to a problem? Please describe.

TextEncoder.prototype.encodeInto is a new method available in latest Chrome and Firefox that allows efficient encoding of strings into a given buffer, avoiding intermediate allocation and copying.

Its primary usecase is a more efficient way to pass strings to WASM (one of the most expensive things one has to do when communicating with JavaScript) and it's already leveraged by Rust wasm-bindgen where available. wasm-bindgen has support for a Node.js target, so it would be great to speed things up here too.

Describe the solution you'd like

Adding native encodeInto method to the TextEncoder class already available in util module.

Please describe the desired behavior.

MDN: https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder/encodeInto

Describe alternatives you've considered

Sending PR to wasm-bindgen to wrap memory into a Node-specific Buffer and use Buffer.prototype.write instead.

It has very similar semantics, and produces similar speed-ups (up to 35% on some of my string-interaction-heavy WASM libraries), but I wanted to see if it's possible to solve this on the Node.js side first so that the same method could be used for all platforms.

cc @jasnell as the author of the original implementation of WHATWG Encoding API in Node.js (#13644)

@RReverser RReverser added the buffer Issues and PRs related to the buffer subsystem. label Mar 25, 2019
@devsnek devsnek added encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. feature request Issues that request new features to be added to Node.js. labels Mar 25, 2019
@jasnell
Copy link
Member

jasnell commented Mar 25, 2019

Should be fairly straightforward to implement. Interested in giving it a try?

@RReverser
Copy link
Member Author

RReverser commented Mar 25, 2019

Potentially, but currently pretty busy with another higher-level project (in which I found out this perf differentiator between running my WASM in different environments).

If nobody takes this before, can give it a try sometime later.

Do you think it would be reasonable if I put good first issue label on this?

@jasnell
Copy link
Member

jasnell commented Mar 25, 2019

Labeling it good first issue should be fine but it will require a bit deeper knowledge than a beginner might have, and would likely require a bit of C++

@RReverser
Copy link
Member Author

RReverser commented Mar 25, 2019

Yeah, depends whether that targets beginners in general or just beginners to the project... I think I can put "good first issue" + "C++" to make it more clear.

@RReverser RReverser added c++ Issues and PRs that require attention from people who are familiar with C++. good first issue Issues that are suitable for first-time contributors. labels Mar 25, 2019
@RReverser
Copy link
Member Author

One difference with Buffer::write I should mention is that encodeInto additionally returns number of string code units written, not just bytes. This is important to allow partial writes. (In theory people using Buffer::write for partial writes would need something like that too, but somehow they got away without it?)

RReverser added a commit to RReverser/wasm-bindgen that referenced this issue Mar 25, 2019
Node.js doesn't currently implement `TextEncoder::encodeInto`. I've raised an upstream issue to add it - nodejs/node#26904 - but it's likely to take some time and will be available only in new releases.

In the meanwhile, it's worth noting that Node.js already has `Buffer::write` which has pretty similar semantics, but doesn't require creating an intermediate view using `.subarray` and instead accepts pointer and length directly.

Also, Node.js has `Buffer::byteLength` helper which allows to efficiently retrieve an encoded byte length of a string upfront, and so allows us to avoid a loop with reallocations.

This change takes leverage of these methods by generating an additional Buffer-based view into the WASM memory and using it for string operations.

I'm seeing up to 35% increase in performance in string-heavy library benchmarks.
@JungMinu JungMinu self-assigned this Mar 25, 2019
@JungMinu
Copy link
Member

going to take care of this

@zandaqo
Copy link

zandaqo commented Jul 5, 2019

This would be really helpful for anyone dealing with binary. As much as I like Buffer, it's not an option for isomorphic code. Is there any ETA on this one yet?

@AtticusYang
Copy link

AtticusYang commented Jul 23, 2019

@RReverser ,I tried to implement this task and I did it. but I am confused by the test case in
https://github.com/nodejs/node/blob/master/test/fixtures/wpt/encoding/encodeInto.any.js,
when input string is "input": "\u{1D306}", the expected read is 2, but the actual is 4, i also did a simple test as followed, the byte length of "\u{1D306}" is 4.

let a = "\u{1D306}";
console.log(a);
console.log(Buffer.byteLength(a, 'utf8'));

and the dest is a uint8array, the byteOffset is 0 and the length is 4.

so I wonder the test case is right or wrong, thanks!

@zandaqo
Copy link

zandaqo commented Jul 23, 2019

@AtticusYang TextEncoder#encodeInto returns the number of code points not the byte length--"\u{1D306}" is encoded with two code points in UTF-16, and indeed four bytes in UTF-8.

@AtticusYang
Copy link

@zandaqo thank for your advice.

AtticusYang pushed a commit to AtticusYang/node that referenced this issue Jul 25, 2019
Add TextEncoder.prototype.encodeInto.

Refs: nodejs#26904
AtticusYang pushed a commit to AtticusYang/node that referenced this issue Jul 26, 2019
Add function encodeInto to TextEncoder, fix bug MessageChannel is not defined
in encodeInto.any.js.

Fixes: nodejs#28851
Refs: nodejs#26904
AtticusYang pushed a commit to AtticusYang/node that referenced this issue Sep 4, 2019
Add function encodeInto to TextEncoder, fix bug MessageChannel is not
defined in encodeInto.any.js.

Fixes: nodejs#28851
Refs: nodejs#26904
@addaleax addaleax removed the good first issue Issues that are suitable for first-time contributors. label Sep 10, 2019
addaleax added a commit to addaleax/node that referenced this issue Sep 10, 2019
Add function encodeInto to TextEncoder, and add MessageChannel
to the encodeInto.any.js test.

Fixes: nodejs#28851
Fixes: nodejs#26904
Refs: nodejs#28862
Co-authored-by: AtticusYang <yyongtai@163.com>
targos pushed a commit that referenced this issue Sep 20, 2019
Add function encodeInto to TextEncoder, and add MessageChannel
to the encodeInto.any.js test.

Fixes: #28851
Fixes: #26904
Refs: #28862
Co-authored-by: AtticusYang <yyongtai@163.com>

PR-URL: #29524
Reviewed-By: David Carlier <devnexen@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
@zandaqo
Copy link

zandaqo commented Sep 23, 2019

I was playing with the latest nightly build (Sep 23) and noticed an odd behavior when non-string is used as a source:

const encoder = new TextEncoder();
const destination = new Uint8Array(10);

encoder.encodeInto(null, destination);
//=> TypeError: The "src" argument must be of type string. Received type object

While the spec does say that the source should be a USVString, its current implementations in both Chrome and Firefox opt for converting the source into a string first instead of throwing an error. So, are we sticking with the current implementation, that is, Node.js going to differ from browsers and throw if a non-string source is provided?

Also, this throwing makes less sense considering that we are ok with non-string values in TextEncoder#encode:

const encoder = new TextEncoder();
encoder.encode(null)
//=> Uint8Array [ 110, 117, 108, 108 ]

BridgeAR pushed a commit that referenced this issue Sep 25, 2019
Add function encodeInto to TextEncoder, and add MessageChannel
to the encodeInto.any.js test.

Fixes: #28851
Fixes: #26904
Refs: #28862
Co-authored-by: AtticusYang <yyongtai@163.com>

PR-URL: #29524
Reviewed-By: David Carlier <devnexen@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. feature request Issues that request new features to be added to Node.js.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants