Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Expose TextEncoder and TextDecoder #11

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

fguitton
Copy link

This PR has for vocation to address jsdom/jsdom#2524 by complementing this package as suggested in jsdom/jsdom#2928.

This PR transforms the repo to adopt a structure similar to whatwg-url.

@TimothyGu
Copy link
Member

Hi, thanks for starting this. As a start, I'd say don't bother with some of the changes, like those to the license (LICENSE.txt), ESLint (.eslint*), Travis CI (.travis.yml), and Git configuration files (.git*) unless you have to (e.g., I realize you will have to add lib/TextEncorder.js to the .gitignore file, but you probably shouldn't change /node_modules/ to node_modules/ as it's outside the scope of this PR). You could remove the live viewer for now (live-viewer/) – if you'd like to make one for this package you could always do so after TextEncoder/Decoder gets implemented!

Additionally, I'd like you to remove the package-lock.json. We at jsdom mostly use Yarn over npm (except for whatwg-url), and as you can see we already have a yarn.lock file in this repo. Let's keep that the case.

src/TextDecoder.webidl Outdated Show resolved Hide resolved
src/TextEncoder.webidl Outdated Show resolved Hide resolved
lib/TextDecoder-impl.js Outdated Show resolved Hide resolved
@fguitton
Copy link
Author

I have started an beginning implementation utilising whatwg-encoding.js, please let me know if I'm on the right track.

Regarding your original comment, as far as I can see package-lock.json, is indeed present in this package.
https://github.com/jsdom/whatwg-encoding/blob/master/package-lock.json
I'm happy to change and use Yarm instead, but this seem to be slightly out of scope.

On that same line, I have converted tests using new Buffer() to using Buffer.from(), this also feels a bit out of scope, should I revert ?

Copy link
Member

@TimothyGu TimothyGu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have started an beginning implementation utilising whatwg-encoding.js, please let me know if I'm on the right track.

It's looking really good! I've left some comments in any case.

Regarding your original comment, as far as I can see package-lock.json, is indeed present in this package.
https://github.com/jsdom/whatwg-encoding/blob/master/package-lock.json
I'm happy to change and use Yarm instead, but this seem to be slightly out of scope.

Heh, interesting. In that case it should be fine then :) sorry for the false alarm.

On that same line, I have converted tests using new Buffer() to using Buffer.from(), this also feels a bit out of scope, should I revert ?

You could do that, but since it's in its own commit and relatively self-contained, leaving it in would work for me.

lib/TextDecoder-impl.js Outdated Show resolved Hide resolved
lib/TextDecoder-impl.js Outdated Show resolved Hide resolved
lib/TextDecoder-impl.js Outdated Show resolved Hide resolved
lib/TextDecoder-impl.js Outdated Show resolved Hide resolved
lib/TextDecoder-impl.js Outdated Show resolved Hide resolved
@fguitton fguitton force-pushed the chore/whatwg-url-model branch 3 times, most recently from 3e93928 to e0bf542 Compare March 30, 2020 14:06
@fguitton
Copy link
Author

fguitton commented Mar 30, 2020

@TimothyGu I have been able to look at this in a bit more detail.

There should now be support for streams as well. I have had to stop relying on the decode function defined in whatwg-encoding.js because I could not see how to cleanly utilize it.

There seems to be limitations to bringing a compliant version up. Of those, here are some of the things right now :

  1. It turns out a lot of testing from https://github.com/web-platform-tests/wpt/tree/master/encoding isn't usable as such as it relies on testing functions not present in this repo's (or whatwg-url's) current version of the testharness.js. Those new scripts could be copied in place, what do you suggest ?

  2. The iconv-lite package doesn't support fatal-like functionality. While we could scan the output of the decoding for [0xef, 0xbf, 0xbd], that exact sequence could have been validly input in and the strategy to deal with that is unclear. What are your thoughts ?

  3. There seems to be multiple problems in tests due to iconv-lite not supporting a set of encodings defined in https://encoding.spec.whatwg.org/encodings.json. That is the case for x-mac-cyrillic for example.

  4. iconv-lite seems to fail some of the conversions. For example iconvLite.decode(new Uint8Array([0x00, 0xd8]), 'utf-16le') ought to produce the same as new TextDecoder('utf-16le').decode(new Uint8Array([0x00, 0xd8])) but doesn't. Any pointers ?

@fguitton fguitton requested a review from TimothyGu April 11, 2020 10:27
@TimothyGu
Copy link
Member

Thanks for doing the additional investigation!

There should now be support for streams as well. I have had to stop relying on the decode function defined in whatwg-encoding.js because I could not see how to cleanly utilize it.

That's fine. It doesn't use decode in the spec either.

There seems to be limitations to bringing a compliant version up. Of those, here are some of the things right now :

  1. It turns out a lot of testing from https://github.com/web-platform-tests/wpt/tree/master/encoding isn't usable as such as it relies on testing functions not present in this repo's (or whatwg-url's) current version of the testharness.js. Those new scripts could be copied in place, what do you suggest ?

I think we can copy some of those here.

  1. The iconv-lite package doesn't support fatal-like functionality. While we could scan the output of the decoding for [0xef, 0xbf, 0xbd], that exact sequence could have been validly input in and the strategy to deal with that is unclear. What are your thoughts?

Let's just say no fatal support for now, if that's what's needed to get something working as soon as possible.

  1. There seems to be multiple problems in tests due to iconv-lite not supporting a set of encodings defined in https://encoding.spec.whatwg.org/encodings.json. That is the case for x-mac-cyrillic for example.
  2. iconv-lite seems to fail some of the conversions. For example iconvLite.decode(new Uint8Array([0x00, 0xd8]), 'utf-16le') ought to produce the same as new TextDecoder('utf-16le').decode(new Uint8Array([0x00, 0xd8])) but doesn't. Any pointers ?

It looks like iconv-lite doesn't fully implement the WHATWG Encoding Standard. Too bad I suppose. In the long run we should either implement our own algorithms, or change iconv-lite's conversions to conform to the Encoding Standard, or add additional codecs in iconv-lite that do conform to Encoding. For now, having any functionality is still better than not, I'd say, as long as we document these known discrepancies in the README.

For the UTF-16 issue, as uncomfortable as it is for me to say this, I think we could use Node.js' built-in TextDecoder. TextDecoder has always supported UTF-16, even on older Node.js versions when small-icu was the default, unlike other encodings. You might find the Node.js internal UTF-8 decoder to be more compliant too, in case you find discrepancies in iconv-lite.

@chyzwar
Copy link

chyzwar commented Aug 13, 2020

Why jsdom cannot just expose TextEncoder and TextDecoder from node.js ? Are these do not conform spec ?

peaBerberian added a commit to canalplus/rx-player that referenced this pull request Jan 11, 2021
This commit allows the RxPlayer to use the `TextEncoder` and `TextDecoder`
APIs when available respectively to encode JS strings into an UTF-8
bytes sequence (TextEncoder doesn't seem to be able to encode into any
other encoding) and to decode from either UTF-8, UTF-16BE or UTF-16LE
into a JS string.

Because `TextEncoder` and `TextDecoder` are not defined in old browser
versions we claim to support and in IE11, we still fallback to custom
implementation either if it doesn't exist or if the operation fails.

It is important to note of a sensible difference between using
the `TextDecoder` interface and the previous implementation: when
encountering invalid byte sequences in the correponding encoding,
the `TextDecoder` will replace those by a "REPLACEMENT CHARACTER" (�).

This seems fine and even desirable, but the previous implementation just
threw in that same situation.
This means that we now have two different behaviors, depending on the
current platform / browser.

Those functions using the `TextDecoder` APIs are even directly defined
in the `StringUtils` tools, and thus that new behavior can be directly
noticable by applications using it.
Thankfully, nothing is defined in our API documentation about invalid
sequences.

Even if we can consider that this does not break our API (though it is
still unclear to me), it should be is something to keep in mind as this
might be unexpected for users relying on this API throwing.

Also, I tried to add unit tests, but it appears that "jsdom", on which
relies jest to perform unit test while simulation a browser in node,
does not include either APIs yet. Though it is under way:
jsdom/whatwg-encoding#11
peaBerberian added a commit to canalplus/rx-player that referenced this pull request Jan 11, 2021
This commit allows the RxPlayer to use the `TextEncoder` and `TextDecoder`
APIs when available respectively to encode JS strings into an UTF-8
bytes sequence (TextEncoder doesn't seem to be able to encode into any
other encoding) and to decode from either UTF-8, UTF-16BE or UTF-16LE
into a JS string.

Because `TextEncoder` and `TextDecoder` are not defined in old browser
versions we claim to support and in IE11, we still fallback to custom
implementation either if it doesn't exist or if the operation fails.

It is important to note of a sensible difference between using
the `TextDecoder` interface and the previous implementation: when
encountering invalid byte sequences in the correponding encoding,
the `TextDecoder` will replace those by a "REPLACEMENT CHARACTER" (�).

This seems fine and even desirable, but the previous implementation just
threw in that same situation.
This means that we now have two different behaviors, depending on the
current platform / browser.

Those functions using the `TextDecoder` APIs are even directly defined
in the `StringUtils` tools, and thus that new behavior can be directly
noticable by applications using it.
Thankfully, nothing is defined in our API documentation about invalid
sequences.

Even if we can consider that this does not break our API (though it is
still unclear to me), it should be is something to keep in mind as this
might be unexpected for users relying on this API throwing.

Also, I tried to add unit tests, but it appears that "jsdom", on which
relies jest to perform unit test while simulation a browser in node,
does not include either APIs yet. Though it is under way:
jsdom/whatwg-encoding#11
peaBerberian added a commit to canalplus/rx-player that referenced this pull request Jan 28, 2021
This commit allows the RxPlayer to use the `TextEncoder` and `TextDecoder`
APIs when available respectively to encode JS strings into an UTF-8
bytes sequence (TextEncoder doesn't seem to be able to encode into any
other encoding) and to decode from either UTF-8, UTF-16BE or UTF-16LE
into a JS string.

Because `TextEncoder` and `TextDecoder` are not defined in old browser
versions we claim to support and in IE11, we still fallback to custom
implementation either if it doesn't exist or if the operation fails.

It is important to note of a sensible difference between using
the `TextDecoder` interface and the previous implementation: when
encountering invalid byte sequences in the correponding encoding,
the `TextDecoder` will replace those by a "REPLACEMENT CHARACTER" (�).

This seems fine and even desirable, but the previous implementation just
threw in that same situation.
This means that we now have two different behaviors, depending on the
current platform / browser.

Those functions using the `TextDecoder` APIs are even directly defined
in the `StringUtils` tools, and thus that new behavior can be directly
noticable by applications using it.
Thankfully, nothing is defined in our API documentation about invalid
sequences.

Even if we can consider that this does not break our API (though it is
still unclear to me), it should be is something to keep in mind as this
might be unexpected for users relying on this API throwing.

Also, I tried to add unit tests, but it appears that "jsdom", on which
relies jest to perform unit test while simulation a browser in node,
does not include either APIs yet. Though it is under way:
jsdom/whatwg-encoding#11
@stbrody
Copy link

stbrody commented Apr 6, 2021

Hey, I'm wondering what the status of this PR is? It looks like it's been sitting idle for almost a year now. The lack of support for TextEncoder and TextDecoder within jsdom, which is heavily utilized by jest, is causing a real headache for our project right now. We'll have to jump through a lot of hoops to work around this in all of our various repos, but it would be much nicer if this was just fixed upstream in jsdom

@fguitton
Copy link
Author

fguitton commented Aug 18, 2021

@TimothyGu I was having a look at this again today.
I have aligned the platform tests to the latest of WPT and updated relevant dependencies.

Beyond the elements I mentioned in #11 (comment) I have come to realize that the testing harness instrumentation from WPT simply cannot work here without a serious rewrite. Following your suggestions on #11 (comment) I can see there are far to many APIs they rely on, such as XMLHttpRequest, that are not implemented in Node and even after trying out a couple of high level polyfills the likes of node-XMLHttpRequest I wasn't able to get more than an extra dozen test to pass.

As such, this PR, with all its limitations, at the very least might allow people with less complex requirements to have an easier life. I am happy to write a disclaimer to accompany the current code should an initial "substandard" release be done. (At this point I'd argue for a not-so-elegant release rather than no release at all)

A brave soul could look at improving the test harness and support for more complex use cases later on 🧠.

@fguitton fguitton force-pushed the chore/whatwg-url-model branch 2 times, most recently from 16dfc21 to 80596b7 Compare September 21, 2022 10:52
@ThisIsMissEm
Copy link

@TimothyGu @fguitton are there further changes needed on this branch, or would it in its current state be a good step forwards?

@domenic
Copy link
Member

domenic commented Oct 12, 2022

As discussed on Twitter, what is needed before this is merged is ensuring all tests pass. Currently most are commented out, and none of the ones in subdirectories of encoding/ are even downloaded.

@hawk-ahmed-matar
Copy link

Any updates?

@ThisIsMissEm
Copy link

@domenic yeah, I've not had a chance to work on this, been snowed under with other things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants