Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextEncoder#encode - write to existing Uint8Array #69

Closed
maciejhirsz opened this issue Aug 19, 2016 · 36 comments
Closed

TextEncoder#encode - write to existing Uint8Array #69

maciejhirsz opened this issue Aug 19, 2016 · 36 comments
Labels

Comments

@maciejhirsz
Copy link

maciejhirsz commented Aug 19, 2016

For binary protocols, such as msgpack, performance of writing/reading strings can be crucial, however those strings rarely (virtually never) exist on a binary buffer on their own. When decoding a string from a binary buffer of Uint8Array (obtained over, say, websockets) it's possible to cheaply create a subarray and pass it over to TextDecoder#decode, as such the decoding story is pretty straight forward.

The writing story is much more complicated.

In order to write a string into an existing buffer, one must take the result Uint8Array from TextEncoder#encode and TypedArray#set it at a desired offset into another buffer, which introduces unnecessary copying and, as result, is slower than plain JavaScript encoding implementations (at least for UTF-8).

A much better solution would be analog to the Node.js Buffer#write implementation where the encoder can write a string to a buffer at a desired offset. Example signature could be something like:

TextEncoder#write(DOMString source, Uint8Array target [, Number offset [, Number length]]) -> Number bytesWritten
@annevk
Copy link
Member

annevk commented Aug 19, 2016

Thank you for reporting this, I've heard a similar request aired for fetch() at some point and there might be other APIs where folks want buffer reuse. I wonder if we should adopt some kind of pattern for this.

@domenic? @inexorabletash?

@inexorabletash
Copy link
Member

Yep, legit request we've heard from potential users before. I don't know what the pattern should be like, or if it should be done at this level (new/extended method) or if we should add Stream support and there should be a generic stream-into-buffer mechanism.

With the write() proposal above, we'd want to return the amount of data consumed (since the encoder won't buffer arbitrary amounts itself) and amount written. That also sounds like a common element of this sort of pattern. (And also the kind of thing streams are good at, so hoping there's a generic solution.)

@domenic
Copy link
Member

domenic commented Aug 19, 2016

Right, if we get transform streams for text encoding, then you'd be able to use a BYOB reader to target specific segments of your destination buffer.

A lower-level method might make sense. If so, don't have separate offset/length parameters; those are already part of the Uint8Array (which is a { byteOffset, byteLength, buffer } tuple).

@shaunc
Copy link

shaunc commented Dec 5, 2017

To my mind writing to/reading from a specified position in an existing ArrayBuffer would be a better interface. Alternately DataView.getString/setString methods could be added in conjunction with this interface.

@hsivonen
Copy link
Member

hsivonen commented Dec 7, 2017

Adding this feature would be problematic, because it would expose to the Web the encoder's behavior when approaching the end of the buffer.

Considering the experience with encoding_rs, it's very useful not to try to fill the output buffer exactly but to

  1. Never let a byte sequence representing a single code point be split across output buffer boundaries
  2. Be allowed to signal that the output buffer is full if the worst-case output length for a code point (4 bytes in the case of UTF-8) doesn't fit regardless of what the next input code point is.

Specifying one particular behavior is likely to lead to problems with encoding back ends that don't exhibit the specified behavior already. encoding_rs doesn't even have a single behavior: the ASCII fast path is more willing to write near the end of the buffer while the handling for non-ASCII want to see output space for the worst case (an astral character).

Allowing implementation-defined behavior, on the other hand, could easily be an interop problem if Web devs test with an implementation that e.g. is willing to convert a final BMP code point when there are 3 bytes of output space remaining but another implementation wants to see 4 bytes of space even if the remaining input would be a BMP code point.

@hsivonen
Copy link
Member

hsivonen commented Dec 7, 2017

I believe we could avoid exposing implementation details (while making the feature less useful) by specifying the the conversion fails immediately if the available output space in bytes is not greater or equal to the number of input UTF-16 code units times three plus one.

In other words if we didn't support incremental encode but required the worst-case (assuming an implementation that wants to see worst-case available space on a per input code point basis, hence the plus one) output space and always consumed the entire input string if it didn't throw right away.

@hsivonen
Copy link
Member

I've thought about this some more. If we are doing the bring your own ArrayBuffer thing only for UTF-8 encode, it's not too onerous to wrap encoding_rs's UTF-8 encoder with something that performs slow but exact buffer filling for the last couple of bytes.

I'd be unhappy to have to do that for all encodings, though, so let's be careful if we ever get a request for decoding to ArrayBuffers.

In any case, the exact buffer filling behavior in the last 3 bytes of the output buffer is an area that is going to need special interop attention to avoid compatibility problems in cases where Web devs misuse a streaming API as a non-streaming API.

@ricea
Copy link
Collaborator

ricea commented Oct 29, 2018

Strawman API proposal: encodeInto(view, string, offset = 0). Example use:

let offset = 0;
while (offset < string.length) {
  offset = encoder.encodeInto(view, string, offset);
  if (offset < string.length) {
    view = getANewViewSomehow();
  }
}

encodeInto() would always write the maximum number of code points that it can without overflowing view. Incomplete utf-8 sequences would not be written. Unused bytes at the end of view would not be touched.

As @hsivonen noted, implementations would probably fall back to a slower algorithm when they get close to the end of the output buffer.

@fitzgen
Copy link

fitzgen commented Oct 29, 2018

This would be very useful for getting JS strings into and out of webassembly memory.

@domenic
Copy link
Member

domenic commented Oct 29, 2018

@ricea given that a view is already a { buffer, byteOffset, byteLength } tuple, I would think the argument should just be a view, not a (view, offset) tuple.

@ricea
Copy link
Collaborator

ricea commented Oct 30, 2018

@domenic Sorry, I should have made it clear. offset is an offset into the string, not into the view. It would be inefficient for callers to chop the front of the string after every call, so instead we include the ability to ignore offset code units.

Argument summary:
view: a BufferSource to accept the output of the conversion
string: A USVString to be converted
offset: The number of code units at the start of string to ignore.

Return value is the index into the string which the encoder reached (but it might be better to return the number of code units consumed?). If the entire string was converted, the return value will be equal to string.length.

If the output buffer is less than 4 bytes long, it is possible for examples like I wrote above to loop forever. Maybe encodeInto() should throw if it can't make progress?

@lukewagner
Copy link

Sorry if I'm missing something obvious but how does the caller of encodeInto learn how many bytes were written into the BufferSource?

If this is indeed missing, one idea is: instead of returning a pair of numbers (which may add call overhead): what if the return value is the number of bytes written to the BufferSource and if less than the full string is written, an exception is thrown and the exception contains the (code-units-consumed, bytes-written) state. Exceptions aren't super-cheap, but if we made them be non-NativeError objects and if this is only coming up for relatively large strings, it might not be too bad when amortized?

Another comment is that, for Rust-to-JS-like use cases, each new encodeInto call will require creating a new view object, so being able to pass the BufferSource + bufferOffset + byteLength triple directly would avoid the costs of constructing, destructuring, and GCing these temporary view objects.

@ricea
Copy link
Collaborator

ricea commented Oct 31, 2018

Sorry if I'm missing something obvious but how does the caller of encodeInto learn how many bytes were written into the BufferSource?

Okay, that's embarrassing 🤣. I did say it was a strawman 😄.

So, you're proposing something like encodeInto(bufferSource, bufferOffset, byteLength, string, stringOffset)?

Example (trying to be more realistic to avoid the trap I fell into last time):

function convertString(bufferSource, string, callback) {
  let bufferSize = 256;
  let bufferStart = malloc(bufferSource, bufferSize);
  let bytesWritten = 0;
  let stringOffset = 0;
  while (true) {
    try {
      bytesWritten += cachedEncoder.encodeInto(bufferSource, bufferStart + bytesWritten, bufferSize, string, stringOffset);
      callback(bufferStart, bytesWritten);
      free(bufferSource, bufferStart);
      break;
    } catch (e) {
      if (e instanceof TextEncoderBufferOverflowError) {
        stringOffset += e.codeUnitsConsumed;
        bytesWritten += e.bytesWritten;
        bufferSize *= 2;
        bufferStart = realloc(bufferSource, bufferStart, bufferSize);
      } else {
        free(bufferSource, bufferStart);
        throw e;
      }
    }
  }
}

I think encodeInto() having 5 arguments is unwieldy, but since the API is aimed at advanced users it may be acceptable.

@domenic
Copy link
Member

domenic commented Oct 31, 2018

I'd strongly prefer that we use array buffer views instead of positional arguments, as they're literally designed for that purpose, and that's consistent with their usage in other write-into APIs on the platform.

@ricea
Copy link
Collaborator

ricea commented Oct 31, 2018

I'd strongly prefer that we use array buffer views instead of positional arguments, as they're literally designed for that purpose, and that's consistent with their usage in other write-into APIs on the platform.

I generally agree, but I'm willing to be persuaded otherwise.

@fitzgen
Copy link

fitzgen commented Oct 31, 2018

Adding encoding into an extant buffer is not an expressibility issue -- one can always do a copy to express the same thing -- it is motivated purely by performance. It follows that if performance is our motivation, we should enable the most performant patterns possible, instead of only going half-way, and that means avoiding buffer view object allocations.

@domenic
Copy link
Member

domenic commented Oct 31, 2018

I understand that in some current implementations not using first-class objects is faster. I don't think we should do API design based on those limitations, otherwise we'd all be writing assembly instead of JavaScript.

@lukewagner
Copy link

lukewagner commented Oct 31, 2018

@domenic Ok, I can see that point of view and I hesitated while writing my final comment for the same reason. Although it requires a bit more optimization magic (which I generally like to avoid), one of the stated goals of wasm Host Bindings is that, when you have a wasm Host Binding coercion that says "create a view from a i32 (begin,end) pair", the engine can optimize away the view allocation and pass the (begin,end) pair directly to an optimized Web IDL internal entry point. With luck, the JS JIT could even hit the same path when the new Uint8Array() call is obviously exclusively consumed by the method call. So I'm fine sticking with the view.

@ricea Thanks for the writeup! Yes, that's what I was imagining (s/arg triplet/view/). The surrounding conversion code could avoid exceptions in even more (or all) cases by using string.length to make the reservation, and probably a stack-y allocator to avoid general-purpose malloc/free overhead making the whole routine rather efficient.

@domenic
Copy link
Member

domenic commented Oct 31, 2018

On exceptions versus tuple returns, I'm a bit unclear whether the exception is encoding an exceptional situation? Is there any way for JS developers to avoid the exception by writing good code?

@lukewagner
Copy link

lukewagner commented Oct 31, 2018

In theory, if they make an appropriate over-reservation via str.length, they could avoid it in all cases. Optimizing away tuple returns is also theoretically possible in wasm (using multiple return values), but somewhat more involved to do that optimization in JS.

@lukewagner
Copy link

I suppose it all depends on how we see people practically using this; the usage patterns I'm thinking about would avoid it most/all of the time, but I could imagine others more like what @ricea wrote above that hit it frequently.

@annevk
Copy link
Member

annevk commented Oct 31, 2018

It seems that if you don't avoid it almost always, you'd better use streams. The exception-pattern proposed is quite a bit different from normal web platform API shapes though. Usually (always?) exceptions throw before side effects and usually (almost always) they do not carry data that needs to be inspected.

@lukewagner
Copy link

Yeah, I can see how the "effect + info in exception object" is an oddball. I guess we can turn the magic dial up a bit more then, taking views and returning pairs and relying on optimizations to kill both these allocations.

@annevk
Copy link
Member

annevk commented Nov 1, 2018

partial interface TextEncoder {
  TextEncoderEncodeIntoResults encodeInto(Uint8Array output, USVString input, optional unsigned long long inputOffset = 0);
};

dictionary TextEncoderEncodeIntoResults {
  unsigned long long outputOffset;
  unsigned long long inputOffset;
};

(Doing inputOffset correctly will be a little involved due to code unit -> scalar value conversion, but shouldn't be problematic. https://infra.spec.whatwg.org/#strings has details.)

@ricea
Copy link
Collaborator

ricea commented Nov 1, 2018

A rewrite of my last example, using @annevk's proposed IDL.

function convertString(buffer, input, callback) {
  let bufferSize = 256;
  let bufferStart = malloc(buffer, bufferSize);
  let bytesWritten = 0;
  let offset = 0;
  while (true) {
    const {outputOffset, inputOffset} = cachedEncoder.encodeInto(
        new Uint8Array(
            buffer, bytesStart + bytesWritten, bufferSize - bytesWritten),
        input, offset);
    offset = inputOffset;
    bytesWritten += outputOffset;
    if (offset === input.length) {
      callback(bufferStart, bytesWritten);
      free(buffer, bufferStart);
      return;
    }

    bufferSize *= 2;
    bufferStart = realloc(buffer, bufferStart, bufferSize);
  }
}

I think it's nice and clear. It might be a bit clunkier than it needs to be because I kept the malloc/free/realloc semantics from my last version.

@hsivonen
Copy link
Member

hsivonen commented Nov 1, 2018

Sorry about the delay.

I'd be unhappy to have to do that for all encodings, though, so let's be careful if we ever get a request for decoding to ArrayBuffers.

After thinking about this more, I've come up with a way to accommodate filling output buffers as much as logically possible in a way that adds complexity only in a wrapper for encoding_rs and not in the internals of the encoding_rs converters, so I withdraw my previous concern about filling a caller-provided output buffer as much as possible in the decode case. (Basically: Trying the last code units speculatively into a temporary buffer using a clone of the encoding_rs converters internal state, discarding the clone if it output too much and promoting the clone into the main converter state if the output could still fit into to caller-provided buffer.)

So I'd be OK with us exposing an encoding_rs-like streaming API that, unlike the encoding_rs API, fills buffers as much as logically possible without splitting a Unicode scalar value.

The encoding_rs streaming API takes the converter state as self/this plus three other arguments: input buffer to read from, output buffer to write to and a boolean indicating whether eof occurs immediately after the input buffer is exhausted. It returns a status, how many code units were read, how many code units were written. Status can be "input empty" or "output full".

If performing replacement, the return tuple has a boolean indicating whether replacements were performed. When not performing replacement, the status has a third state that encapsulates error (when encoding, the unmappable scalar value; when decoding, the length of the illegal byte sequence and how many bytes in the past the illegal sequence was). Identifying the erroneous byte sequence probably shouldn't be done on the Web, because other back ends will have a hard time getting it right.

This streaming API never allocates on the heap.

If providing an incremental API, experience with the uconv and encoding_rs APIs (which differ on this design point) strongly indicates that doing the EOF signaling the encoding_rs way is superior compared to either not signaling it (bogus API) or having separate API surface for it (additional API surface that can produce output).

@hsivonen
Copy link
Member

hsivonen commented Nov 1, 2018

(bool last is not needed in the encoder when not supporting encode into ISO-2022-JP.)

So to suggest the interface for encoding from a DOMString into UTF-8 (only) Uint8Array:

partial interface TextEncoder {
  TextEncoderEncodeIntoResult encodeInto(StringView source, Uint8Array destination);
};

dictionary TextEncoderEncodeIntoResult {
  bool inputEmpty; // true if input was exhausted. false if more output space is needed
  unsigned long long read; // Number of UTF-16 code units read
  unsigned long long written; // Number of UTF-8 code units written
};

dictionary StringView {
  DOMString str;
  unsigned long long start; // index of the first UTF-16 code unit that's logically part of the view
  unsigned long long length; // length of the view in UTF-16 code units
};

Unpaired surrogates are replaced with the REPLACEMENT CHARACTER. Unpairedness analysis is done on a per view basis (i.e. ending a StringView with an unpaired surrogate means an unpaired surrogate even if the surrogate is paired in the underlying DOMString). Output is filled as much as possible without splitting a UTF-8 sequence in output.

@hsivonen
Copy link
Member

hsivonen commented Nov 2, 2018

Yesterday, I had my mind too much in the mode of supporting all encoders even though only the UTF-8 encoder would be exposed to JS/Wasm. Sorry.

bool inputEmpty; // true if input was exhausted. false if more output space is needed

In the encode case, that bit can be inferred from read when encode to ISO-2022-JP is not supported.

The encoding_rs APIs I linked to in my previous comments are for streaming. The Wasm argument conversion case is closer to what Gecko's in-memory XPCOM string conversions need.

For in-memory conversions, encoding_rs provides these simplified APIs:

When the latter is fails to convert the whole string, a reallocation plus use of the first function is needed. In Gecko, the potentially too short buffer is allocated by rounding the best case up to the allocator bucket size (leaky abstraction).

Going back to the issue of string view or start index into a string: Do we need that or can we trust the JS substring operation not to copy the underlying buffer?

@ricea
Copy link
Collaborator

ricea commented Nov 2, 2018

Going back to the issue of string view or start index into a string: Do we need that or can we trust the JS substring operation not to copy the underlying buffer?

I don't trust it. In particular, in Blink once you pass a string to an IDL method it turns into an embedder string, which doesn't support zero-copy substrings.

Even if the copy is free, it still creates more garbage that has to be collected at some point.

Requiring the caller to allocate 3x string length might be an option. Blink's implementation of encode() does that internally anyway, so it wouldn't be worse.

@hsivonen
Copy link
Member

hsivonen commented Nov 6, 2018

I don't trust it. In particular, in Blink once you pass a string to an IDL method it turns into an embedder string, which doesn't support zero-copy substrings.

Do we need to care about the characteristics of the substring operation on the C++ side of WebIDL? It seems to me the side that matters here is the JS side.

Or do you mean that in Blink a JS string that points to the buffer of another JS string gets copied on the IDL layer but a JS string that is the exclusive owner of its buffer doesn't get copied?

Even if the copy is free, it still creates more garbage that has to be collected at some point.

I defer to JS engine developers on whether this matters.

@ricea
Copy link
Collaborator

ricea commented Nov 6, 2018

Do we need to care about the characteristics of the substring operation on the C++ side of WebIDL? It seems to me the side that matters here is the JS side.

Yes, I had misunderstood the situation. The string is copied when accessed by Blink, and the original V8 string is left untouched. So any substring optimisation that V8 performs will still work.

I don't know the details, but it appears that V8 will do zero-copy substrings in at least some situations.

So the only overhead we have to worry about is GC. As such, maybe a StringView interface is overkill.

annevk added a commit that referenced this issue Dec 13, 2018
This enables converting strings into byte sequences of pre-allocated buffers.

Also cleanup TextEncoder a bit.

Tests: ...

Fixes #69.
@annevk
Copy link
Member

annevk commented Dec 13, 2018

I wrote up a formal proposal at #166 and some preliminary tests (that should be easy to add to) at web-platform-tests/wpt#14505.

As I mentioned in the PR I've avoided the string view dictionary for now and you'll instead have to use substring or some such in JavaScript. If over time this turns out to be prohibitive we can quite easily overload the string argument with a string view dictionary.

I'm not requiring 3x string length as that seems unnecessarily constraining and might result in the caller having to jump through a lot of hoops (basically avoiding this API in a number of cases).

The result looks like reasonably idiomatic JavaScript so hopefully this is satisfactory to everyone as a v0 of this particular API.

Thanks to @maciejhirsz for raising this almost two-and-half years ago!

@annevk
Copy link
Member

annevk commented Dec 14, 2018

@ricea I tried to rewrite your example for inclusion in the standard:

function convertString(buffer, input, callback) {
  let bufferSize = 256;
  let bufferStart = malloc(buffer, bufferSize);
  let bytesWritten = 0;
  let offset = 0;
  while (true) {
    const {read, written} = cachedEncoder.encodeInto(input.substring(offset), new Uint8Array(buffer, bytesStart + bytesWritten, bufferSize - bytesWritten));
    offset += read;
    bytesWritten += written;
    if (offset === input.length) {
      callback(bufferStart, bytesWritten);
      free(buffer, bufferStart);
      return;
    }
    bufferSize *= 2;
    bufferStart = realloc(buffer, bufferStart, bufferSize);
  }
}

For the length argument to Uint8Array, shouldn't bufferStart also be subtracted? (I also changed offset = read to offset += read. Better names welcome.)

@annevk annevk added the addition/proposal New features or enhancements label Dec 14, 2018
@ricea
Copy link
Collaborator

ricea commented Dec 15, 2018

For the length argument to Uint8Array, shouldn't bufferStart also be subtracted? (I also changed offset = read to offset += read. Better names welcome.)

No, I think it is okay as is.

Suppose malloc() returns 1000. Then the first call to the constructor will be new Uint8Array(buffer, 1000 + 0, 256 - 0). Assume that all 256 bytes were used, and realloc() returns 1000 again. Then the second call to the constructor will be new Uint8Array(buffer, 1000 + 256, 512 - 256). This all looks good.

I realised that I wrote bytesStart in the constructor arguments where it should have been bufferStart.

Maybe offset would be clearer if it was called readOffset? Possibly bytesWritten could be changed to writeOffset to match.

annevk added a commit that referenced this issue Jan 9, 2019
This enables converting strings into UTF-8 byte sequences reusing a pre-allocated output buffer.

Also cleans up TextEncoder a bit.

Tests: web-platform-tests/wpt#14505.

Fixes #69.
@jimmywarting
Copy link

cbor-x raised similar concerns and prefered to stick to using node:buffer write method instead of using something more cross env. friendlier such as TextEncoder and drags in the hole node:buffer with it.

let { Buffer } = await import('https://esm.sh/buffer')
let u = new Uint8Array(2048)
let b = Buffer.alloc(2048)
let te = new TextEncoder()

console.time('write')
for (let i = 0; i < 10000000; i++) b.write('hello world', 10)
console.timeEnd('write')

console.time('encodeInto subview')
for (let i = 0; i < 10000000; i++) te.encodeInto('hello world', u.subarray(10))
console.timeEnd('encodeInto subview')

// note that this is on par with buffer.write, but is useless since you can't
// specify a starting index, subarray time needs to be included in a fair comparison
console.time('encodeInto')
for (let i = 0; i < 10000000; i++) te.encodeInto('hello world', u)
console.timeEnd('encodeInto')

The browserified buffer module even out perf native TextEncoder

Chrome 109
write: 657.280029296875 ms
encodeInto subview: 4531.56005859375 ms
encodeInto: 2966.76904296875 ms

Firefox 109
write: 1463ms
encodeInto subview: 2846ms
encodeInto: 1869ms

Safari 16.2
write: 1750.032ms
encodeInto subview: 1392.487ms
encodeInto: 1011.426ms

@foolip
Copy link
Member

foolip commented Feb 4, 2023

@jimmywarting any comments on a closed issue are easily lost in the noise, especially on very old issues. It might show up in someone's notifications, but that's it, if not acted upon then it's just gone.

So I would suggest always checking if the issue is open before commenting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests