-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TextEncoder#encode - write to existing Uint8Array #69
Comments
Thank you for reporting this, I've heard a similar request aired for |
Yep, legit request we've heard from potential users before. I don't know what the pattern should be like, or if it should be done at this level (new/extended method) or if we should add Stream support and there should be a generic stream-into-buffer mechanism. With the |
Right, if we get transform streams for text encoding, then you'd be able to use a BYOB reader to target specific segments of your destination buffer. A lower-level method might make sense. If so, don't have separate offset/length parameters; those are already part of the Uint8Array (which is a { byteOffset, byteLength, buffer } tuple). |
To my mind writing to/reading from a specified position in an existing ArrayBuffer would be a better interface. Alternately |
Adding this feature would be problematic, because it would expose to the Web the encoder's behavior when approaching the end of the buffer. Considering the experience with encoding_rs, it's very useful not to try to fill the output buffer exactly but to
Specifying one particular behavior is likely to lead to problems with encoding back ends that don't exhibit the specified behavior already. encoding_rs doesn't even have a single behavior: the ASCII fast path is more willing to write near the end of the buffer while the handling for non-ASCII want to see output space for the worst case (an astral character). Allowing implementation-defined behavior, on the other hand, could easily be an interop problem if Web devs test with an implementation that e.g. is willing to convert a final BMP code point when there are 3 bytes of output space remaining but another implementation wants to see 4 bytes of space even if the remaining input would be a BMP code point. |
I believe we could avoid exposing implementation details (while making the feature less useful) by specifying the the conversion fails immediately if the available output space in bytes is not greater or equal to the number of input UTF-16 code units times three plus one. In other words if we didn't support incremental encode but required the worst-case (assuming an implementation that wants to see worst-case available space on a per input code point basis, hence the plus one) output space and always consumed the entire input string if it didn't throw right away. |
I've thought about this some more. If we are doing the bring your own I'd be unhappy to have to do that for all encodings, though, so let's be careful if we ever get a request for decoding to In any case, the exact buffer filling behavior in the last 3 bytes of the output buffer is an area that is going to need special interop attention to avoid compatibility problems in cases where Web devs misuse a streaming API as a non-streaming API. |
Strawman API proposal: let offset = 0;
while (offset < string.length) {
offset = encoder.encodeInto(view, string, offset);
if (offset < string.length) {
view = getANewViewSomehow();
}
}
As @hsivonen noted, implementations would probably fall back to a slower algorithm when they get close to the end of the output buffer. |
This would be very useful for getting JS strings into and out of webassembly memory. |
@ricea given that a view is already a { buffer, byteOffset, byteLength } tuple, I would think the argument should just be a view, not a (view, offset) tuple. |
@domenic Sorry, I should have made it clear. offset is an offset into the string, not into the view. It would be inefficient for callers to chop the front of the string after every call, so instead we include the ability to ignore offset code units. Argument summary: Return value is the index into the string which the encoder reached (but it might be better to return the number of code units consumed?). If the entire string was converted, the return value will be equal to string.length. If the output buffer is less than 4 bytes long, it is possible for examples like I wrote above to loop forever. Maybe encodeInto() should throw if it can't make progress? |
Sorry if I'm missing something obvious but how does the caller of If this is indeed missing, one idea is: instead of returning a pair of numbers (which may add call overhead): what if the return value is the number of bytes written to the BufferSource and if less than the full Another comment is that, for Rust-to-JS-like use cases, each new |
Okay, that's embarrassing 🤣. I did say it was a strawman 😄. So, you're proposing something like Example (trying to be more realistic to avoid the trap I fell into last time): function convertString(bufferSource, string, callback) {
let bufferSize = 256;
let bufferStart = malloc(bufferSource, bufferSize);
let bytesWritten = 0;
let stringOffset = 0;
while (true) {
try {
bytesWritten += cachedEncoder.encodeInto(bufferSource, bufferStart + bytesWritten, bufferSize, string, stringOffset);
callback(bufferStart, bytesWritten);
free(bufferSource, bufferStart);
break;
} catch (e) {
if (e instanceof TextEncoderBufferOverflowError) {
stringOffset += e.codeUnitsConsumed;
bytesWritten += e.bytesWritten;
bufferSize *= 2;
bufferStart = realloc(bufferSource, bufferStart, bufferSize);
} else {
free(bufferSource, bufferStart);
throw e;
}
}
}
} I think |
I'd strongly prefer that we use array buffer views instead of positional arguments, as they're literally designed for that purpose, and that's consistent with their usage in other write-into APIs on the platform. |
I generally agree, but I'm willing to be persuaded otherwise. |
Adding encoding into an extant buffer is not an expressibility issue -- one can always do a copy to express the same thing -- it is motivated purely by performance. It follows that if performance is our motivation, we should enable the most performant patterns possible, instead of only going half-way, and that means avoiding buffer view object allocations. |
I understand that in some current implementations not using first-class objects is faster. I don't think we should do API design based on those limitations, otherwise we'd all be writing assembly instead of JavaScript. |
@domenic Ok, I can see that point of view and I hesitated while writing my final comment for the same reason. Although it requires a bit more optimization magic (which I generally like to avoid), one of the stated goals of wasm Host Bindings is that, when you have a wasm Host Binding coercion that says "create a view from a i32 (begin,end) pair", the engine can optimize away the view allocation and pass the (begin,end) pair directly to an optimized Web IDL internal entry point. With luck, the JS JIT could even hit the same path when the @ricea Thanks for the writeup! Yes, that's what I was imagining (s/arg triplet/view/). The surrounding conversion code could avoid exceptions in even more (or all) cases by using |
On exceptions versus tuple returns, I'm a bit unclear whether the exception is encoding an exceptional situation? Is there any way for JS developers to avoid the exception by writing good code? |
In theory, if they make an appropriate over-reservation via |
I suppose it all depends on how we see people practically using this; the usage patterns I'm thinking about would avoid it most/all of the time, but I could imagine others more like what @ricea wrote above that hit it frequently. |
It seems that if you don't avoid it almost always, you'd better use streams. The exception-pattern proposed is quite a bit different from normal web platform API shapes though. Usually (always?) exceptions throw before side effects and usually (almost always) they do not carry data that needs to be inspected. |
Yeah, I can see how the "effect + info in exception object" is an oddball. I guess we can turn the magic dial up a bit more then, taking views and returning pairs and relying on optimizations to kill both these allocations. |
partial interface TextEncoder {
TextEncoderEncodeIntoResults encodeInto(Uint8Array output, USVString input, optional unsigned long long inputOffset = 0);
};
dictionary TextEncoderEncodeIntoResults {
unsigned long long outputOffset;
unsigned long long inputOffset;
}; (Doing |
A rewrite of my last example, using @annevk's proposed IDL. function convertString(buffer, input, callback) {
let bufferSize = 256;
let bufferStart = malloc(buffer, bufferSize);
let bytesWritten = 0;
let offset = 0;
while (true) {
const {outputOffset, inputOffset} = cachedEncoder.encodeInto(
new Uint8Array(
buffer, bytesStart + bytesWritten, bufferSize - bytesWritten),
input, offset);
offset = inputOffset;
bytesWritten += outputOffset;
if (offset === input.length) {
callback(bufferStart, bytesWritten);
free(buffer, bufferStart);
return;
}
bufferSize *= 2;
bufferStart = realloc(buffer, bufferStart, bufferSize);
}
} I think it's nice and clear. It might be a bit clunkier than it needs to be because I kept the malloc/free/realloc semantics from my last version. |
Sorry about the delay.
After thinking about this more, I've come up with a way to accommodate filling output buffers as much as logically possible in a way that adds complexity only in a wrapper for encoding_rs and not in the internals of the encoding_rs converters, so I withdraw my previous concern about filling a caller-provided output buffer as much as possible in the decode case. (Basically: Trying the last code units speculatively into a temporary buffer using a clone of the encoding_rs converters internal state, discarding the clone if it output too much and promoting the clone into the main converter state if the output could still fit into to caller-provided buffer.) So I'd be OK with us exposing an encoding_rs-like streaming API that, unlike the encoding_rs API, fills buffers as much as logically possible without splitting a Unicode scalar value. The encoding_rs streaming API takes the converter state as If performing replacement, the return tuple has a boolean indicating whether replacements were performed. When not performing replacement, the status has a third state that encapsulates error (when encoding, the unmappable scalar value; when decoding, the length of the illegal byte sequence and how many bytes in the past the illegal sequence was). Identifying the erroneous byte sequence probably shouldn't be done on the Web, because other back ends will have a hard time getting it right. This streaming API never allocates on the heap.
If providing an incremental API, experience with the uconv and encoding_rs APIs (which differ on this design point) strongly indicates that doing the EOF signaling the encoding_rs way is superior compared to either not signaling it (bogus API) or having separate API surface for it (additional API surface that can produce output). |
( So to suggest the interface for encoding from a partial interface TextEncoder {
TextEncoderEncodeIntoResult encodeInto(StringView source, Uint8Array destination);
};
dictionary TextEncoderEncodeIntoResult {
bool inputEmpty; // true if input was exhausted. false if more output space is needed
unsigned long long read; // Number of UTF-16 code units read
unsigned long long written; // Number of UTF-8 code units written
};
dictionary StringView {
DOMString str;
unsigned long long start; // index of the first UTF-16 code unit that's logically part of the view
unsigned long long length; // length of the view in UTF-16 code units
}; Unpaired surrogates are replaced with the REPLACEMENT CHARACTER. Unpairedness analysis is done on a per view basis (i.e. ending a |
Yesterday, I had my mind too much in the mode of supporting all encoders even though only the UTF-8 encoder would be exposed to JS/Wasm. Sorry.
In the encode case, that bit can be inferred from The encoding_rs APIs I linked to in my previous comments are for streaming. The Wasm argument conversion case is closer to what Gecko's in-memory XPCOM string conversions need. For in-memory conversions, encoding_rs provides these simplified APIs:
When the latter is fails to convert the whole string, a reallocation plus use of the first function is needed. In Gecko, the potentially too short buffer is allocated by rounding the best case up to the allocator bucket size (leaky abstraction). Going back to the issue of string view or start index into a string: Do we need that or can we trust the JS substring operation not to copy the underlying buffer? |
I don't trust it. In particular, in Blink once you pass a string to an IDL method it turns into an embedder string, which doesn't support zero-copy substrings. Even if the copy is free, it still creates more garbage that has to be collected at some point. Requiring the caller to allocate 3x string length might be an option. Blink's implementation of encode() does that internally anyway, so it wouldn't be worse. |
Do we need to care about the characteristics of the substring operation on the C++ side of WebIDL? It seems to me the side that matters here is the JS side. Or do you mean that in Blink a JS string that points to the buffer of another JS string gets copied on the IDL layer but a JS string that is the exclusive owner of its buffer doesn't get copied?
I defer to JS engine developers on whether this matters. |
Yes, I had misunderstood the situation. The string is copied when accessed by Blink, and the original V8 string is left untouched. So any substring optimisation that V8 performs will still work. I don't know the details, but it appears that V8 will do zero-copy substrings in at least some situations. So the only overhead we have to worry about is GC. As such, maybe a StringView interface is overkill. |
This enables converting strings into byte sequences of pre-allocated buffers. Also cleanup TextEncoder a bit. Tests: ... Fixes #69.
I wrote up a formal proposal at #166 and some preliminary tests (that should be easy to add to) at web-platform-tests/wpt#14505. As I mentioned in the PR I've avoided the string view dictionary for now and you'll instead have to use substring or some such in JavaScript. If over time this turns out to be prohibitive we can quite easily overload the string argument with a string view dictionary. I'm not requiring 3x string length as that seems unnecessarily constraining and might result in the caller having to jump through a lot of hoops (basically avoiding this API in a number of cases). The result looks like reasonably idiomatic JavaScript so hopefully this is satisfactory to everyone as a v0 of this particular API. Thanks to @maciejhirsz for raising this almost two-and-half years ago! |
@ricea I tried to rewrite your example for inclusion in the standard: function convertString(buffer, input, callback) {
let bufferSize = 256;
let bufferStart = malloc(buffer, bufferSize);
let bytesWritten = 0;
let offset = 0;
while (true) {
const {read, written} = cachedEncoder.encodeInto(input.substring(offset), new Uint8Array(buffer, bytesStart + bytesWritten, bufferSize - bytesWritten));
offset += read;
bytesWritten += written;
if (offset === input.length) {
callback(bufferStart, bytesWritten);
free(buffer, bufferStart);
return;
}
bufferSize *= 2;
bufferStart = realloc(buffer, bufferStart, bufferSize);
}
} For the length argument to Uint8Array, shouldn't bufferStart also be subtracted? (I also changed offset = read to offset += read. Better names welcome.) |
No, I think it is okay as is. Suppose I realised that I wrote Maybe |
This enables converting strings into UTF-8 byte sequences reusing a pre-allocated output buffer. Also cleans up TextEncoder a bit. Tests: web-platform-tests/wpt#14505. Fixes #69.
cbor-x raised similar concerns and prefered to stick to using let { Buffer } = await import('https://esm.sh/buffer')
let u = new Uint8Array(2048)
let b = Buffer.alloc(2048)
let te = new TextEncoder()
console.time('write')
for (let i = 0; i < 10000000; i++) b.write('hello world', 10)
console.timeEnd('write')
console.time('encodeInto subview')
for (let i = 0; i < 10000000; i++) te.encodeInto('hello world', u.subarray(10))
console.timeEnd('encodeInto subview')
// note that this is on par with buffer.write, but is useless since you can't
// specify a starting index, subarray time needs to be included in a fair comparison
console.time('encodeInto')
for (let i = 0; i < 10000000; i++) te.encodeInto('hello world', u)
console.timeEnd('encodeInto') The browserified buffer module even out perf native TextEncoder
|
@jimmywarting any comments on a closed issue are easily lost in the noise, especially on very old issues. It might show up in someone's notifications, but that's it, if not acted upon then it's just gone. So I would suggest always checking if the issue is open before commenting. |
For binary protocols, such as msgpack, performance of writing/reading strings can be crucial, however those strings rarely (virtually never) exist on a binary buffer on their own. When decoding a string from a binary buffer of Uint8Array (obtained over, say, websockets) it's possible to cheaply create a subarray and pass it over to
TextDecoder#decode
, as such the decoding story is pretty straight forward.The writing story is much more complicated.
In order to write a string into an existing buffer, one must take the result
Uint8Array
fromTextEncoder#encode
andTypedArray#set
it at a desired offset into another buffer, which introduces unnecessary copying and, as result, is slower than plain JavaScript encoding implementations (at least for UTF-8).A much better solution would be analog to the Node.js
Buffer#write
implementation where the encoder can write a string to a buffer at a desired offset. Example signature could be something like:The text was updated successfully, but these errors were encountered: