Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide a way to introduce new encodings for new Buffer('someString', 'someEncoding') #2835

Closed
Mithgol opened this issue Sep 12, 2015 · 20 comments
Labels
buffer Issues and PRs related to the buffer subsystem. feature request Issues that request new features to be added to Node.js.

Comments

@Mithgol
Copy link

Mithgol commented Sep 12, 2015

In the constructor new Buffer('someString', 'someEncoding') Node.js v4.0.0 itself supports a limited number of encodings: 'ascii', 'utf8', 'utf16le' (aka 'ucs2'), 'base64', 'hex', 'binary'. That's a half of a dozen. That's not plenty.

And that's why a widely used package iconv-lite (450+ direct dependents, ≈240+ thousands of daily downloads) provides a method (.extendNodeEncodings()) that adds a support of many other known encodings to the Buffer API.

However, iconv-lite does not seem to work in Node v4.0.0 well enough. Any use of an iconv-lite-provided encoding in an attempt of new Buffer('someLatinString', 'encodingName') results in some random output such as the following:

(screenshot)

It seems to me that either 0fa6c4a was not enough to fix #1547 or a separate deeper issue exists.

Currently iconv-lite's extend-node.js changes the behaviour of the following methods:

in SlowBuffer in Buffer
SlowBuffer.prototype.toString Buffer.prototype.toString
SlowBuffer.prototype.write Buffer.prototype.write
SlowBuffer.byteLength Buffer.byteLength
  Buffer.isEncoding

Why those changes were enough in Node v0.10 and v0.12 but aren't in Node v4.0.0?

I may be wrong, but… it seems to me that in Node v0.12 the Buffer's constructor have used this.write(subject, encoding) internally but in the current Node v4.0.0 neither the Buffer's constructor nor its fromString helper do that. They seem to use binding.createFromString(string, encoding) (where necessary) or allocPool.write(string, poolOffset, encoding) and both of these come from process.binding('buffer') and aren't replaced by iconv-lite. And it won't be easy to replace them from userland, I suppose.

Is my assumption correct?

What should be done in iconv-lite (or in Node.js, or in both) for a multitude of encodings to work in the new Buffer('someString', 'encodingName') constructor correctly?

@Mithgol
Copy link
Author

Mithgol commented Sep 12, 2015

And a status badge goes to #2798.

Mithgol added a commit to Mithgol/node-fidonet-jam that referenced this issue Sep 12, 2015
Mithgol added a commit to Mithgol/node-twi2fido that referenced this issue Sep 12, 2015
Mithgol added a commit to Mithgol/echolist-csv2hpt that referenced this issue Sep 12, 2015
@Mithgol
Copy link
Author

Mithgol commented Sep 13, 2015

This issue may be seen as a spiritual successor to nodejs/node-v0.x-archive#1772.

Mithgol added a commit to Mithgol/node-fidonet-jam that referenced this issue Sep 13, 2015
Mithgol added a commit to Mithgol/node-twi2fido that referenced this issue Sep 13, 2015
Mithgol added a commit to Mithgol/fido2rss that referenced this issue Sep 13, 2015
Mithgol added a commit to Mithgol/echolist-csv2hpt that referenced this issue Sep 13, 2015
@bnoordhuis
Copy link
Member

I'm inclined to say this is solely on iconv-lite's head for trying to monkey-patch core, and I say this as the author of a module that does something similar. Opinions, anyone?

@ChALkeR
Copy link
Member

ChALkeR commented Sep 13, 2015

Monkey-patching is not supported, and will never be for obvious reasons. That includes both overriding builtin methods and adding new methods to builtins (both Node.js and v8). Such code could and will break in minor or patch versions, or even within the same Node.js version.

It's entirely iconv-lite fault for breaking here.

On the other hand, a supported method of defining new encodings seems like a valid idea, given that overriding an already defined (both built-in or thirdparty) encoding produces an error.

@Fishrock123
Copy link
Contributor

Introducing new encoding options from userland seems valid, so long as it doesn't require any V8 monkey-business.

Patching core, especially in this case, isn't exactly supported. Buffer changes were forced on us via V8, nothing we can really do.

@ChALkeR
Copy link
Member

ChALkeR commented Sep 13, 2015

  • lib/buffer.js#L191.isEncoding. This could be patched to be aware of user-defined encodings.
  • lib/buffer.js#L342.toString. This could be patched to support user-defined .toString for a new encoding.
  • lib/buffer.js#L87.fromString. This could be patched to support user-defined .fromString for a new encoding.
  • lib/buffer.js#L534.write. This could be patched to support user-defined .write for a new encoding.

We could add API like Buffer.registerEncoding(encoding, toBinary, fromBinary), which would register encoding if it doesn't exist yet, and that will throw an error if that encoding is already defined.

@ChALkeR
Copy link
Member

ChALkeR commented Sep 13, 2015

On the other hand, I see no actual profit in the above. I am just saying that doing so is possible.

As always, partial (~⅓) usage of .extendNodeEncodings:

a/alinator-0.2.2.tgz/lib/requests/scheduleRequest.js:9:iconv.extendNodeEncodings();
a/appbuilder-2.9.3-426.tgz/lib/common/mobile/android/android-emulator-services.js:29:        iconv.extendNodeEncodings();
c/ctbc_payment-0.2.1.tgz/lib/ctbc_creditcard.js:5:iconv.extendNodeEncodings();
c/ctbc_payment-0.2.1.tgz/lib/ctbc_unionpay.js:5:iconv.extendNodeEncodings();
d/delatinise-cli-0.1.11.tgz/lib/delatinise.js:159:      iconv.extendNodeEncodings();
f/fiunis-2.1.1.tgz/fiunis.js:1:require('iconv-lite').extendNodeEncodings();
f/fidonet-squish-0.0.39.tgz/fidonet-squish.js:5:require('iconv-lite').extendNodeEncodings();
f/fidonet-jam-3.8.7.tgz/fidonet-jam.js:6:require('iconv-lite').extendNodeEncodings();
h/hzip-1.1.0.tgz/encoding/iconv-lite/lib/extend-node.js:7:    iconv.extendNodeEncodings = function extendNodeEncodings() {
h/hzip-1.1.0.tgz/encoding/iconv-lite/lib/extend-node.js:187:            throw new Error("require('iconv-lite').undoExtendNodeEncodings(): Nothing to undo; extendNodeEncodings() is not called.")
i/iconv-lite-0.4.10.tgz/lib/extend-node.js:8:    iconv.extendNodeEncodings = function extendNodeEncodings() {
i/iconv-lite-0.4.10.tgz/lib/extend-node.js:179:            throw new Error("require('iconv-lite').undoExtendNodeEncodings(): Nothing to undo; extendNodeEncodings() is not called.")
l/linez-3.2.1.tgz/js/linez.js:4:iconv.extendNodeEncodings();
n/node-captions-0.4.2.tgz/captions.js:12:iconv.extendNodeEncodings();
n/node-cmpp-3-1.1.7.tgz/cmppSocket.js:17:iconv.extendNodeEncodings();
n/nativescript-1.0.2.tgz/lib/common/mobile/android/android-emulator-services.js:29:        iconv.extendNodeEncodings();
n/node-ral-0.0.36.tgz/lib/ral.js:25:iconv.extendNodeEncodings();
n/node-ral-0.0.36.tgz/test/protocol/http_protocol_post_test.js:17:iconv.extendNodeEncodings();
r/rastreiojs-0.2.4.tgz/lib/index.js:21:iconv.extendNodeEncodings();
s/simteconf-0.6.10.tgz/simteconf.js:10:require('iconv-lite').extendNodeEncodings();
s/slap-0.1.28.tgz/lib/cli.js:80:    iconv.extendNodeEncodings();
s/sro-0.1.4.tgz/lib/sro.js:13:iconv.extendNodeEncodings();
u/umpay-transfer-1.0.4.tgz/index.js:10:iconv.extendNodeEncodings();
u/umpay-1.1.1.tgz/index.js:10:iconv.extendNodeEncodings();

That's actually pretty low (18 modules). I expect the total count to be around 50-60 modules.

@Mithgol Could you explain how iconv.extendNodeEncodings() is better than manually converting those encodings, for the reference?

@Mithgol
Copy link
Author

Mithgol commented Sep 13, 2015

@ChALkeR

It's better because it's easier to introduce to an existing project. It feels almost infinitely easier.

You just write require('iconv-lite').extendNodeEncodings() once and suddenly it just works everywhere like a charm.

You don't have to do anything else.

For example, you don't have to remember every place where Buffer(string, encoding) was written previously and carefully upgrade them to iconvLite.encode while simultaneously keeping in mind that Buffer(string) (without encoding) defaults to UTF-8 but iconvLite.encode doesn't.

@seishun
Copy link
Contributor

seishun commented Oct 15, 2015

I don't think it should be supported to change the behavior of core functions from userland. We don't do that anywhere else as far as I'm aware.

@danielgindi
Copy link

I think that allowing a module to add more encodings to Buffer - makes so much sense, that I can't understand the resistance.
Take for example a giant project, which has many points where encodings are used. It reads CSVs, it creates log files and Excel files, it is maybe reading outputs of other system modules or remote modules over the web. Encodings may vary! And you don't want to write so much code to handle encodings in so many places, it almost seems like a hack itself.

The proper solution would be to use the native Node.js features for handling Buffers and Encoding. But the thing that is missing is the support for new encodings.

@Mithgol
Copy link
Author

Mithgol commented Oct 21, 2015

@seishun

That's true, Node.js does not support extending the behaviour of its core functions anywhere else, but that's probably because Node.js does not have an obviously limited support of an obviously vast area (such as encodings) anywhere else.

Also, most of other such core modules are wrappers around third-party libraries, and that's their excuse. For example,

  • How should I extend crypto with SHA-3? — crypto is a wrapper around OpenSSL, just wait until that hash becomes supported in the upstream.
  • How should I extend Intl with a support for Tengwar? — We merely wrap around ICU, you should wait until Tengwar becomes supported by the Unicode Consortium according to their roadmap (subject to change).
  • How should I extend zlib with a support for unpacking .ZIP files in addition to its current inflating deflated streams? — zlib is merely a wrapper around the well-known zlib library and thus it's weird to add some features that are missing in the upstream.

@Mithgol
Copy link
Author

Mithgol commented Dec 21, 2015

This comment is a nudge after a couple of months.

@trevnorris
Copy link
Contributor

@Mithgol Basically taking @ChALkeR's API, would
Buffer.registerEncoding(encoding, toBuffer, toString) work? Where toBuffer returns a Buffer and toString return a String. Both of which are propagated to the user.

@Mithgol
Copy link
Author

Mithgol commented Dec 28, 2015

Yes.

@seishun
Copy link
Contributor

seishun commented Dec 28, 2015

I don't think Buffer should be used for encoding/decoding beyond the few common encodings that it provides for convenience. It's simply outside of its scope of responsibility.

For example, you don't have to remember every place where Buffer(string, encoding) was written previously and carefully upgrade them to iconvLite.encode while simultaneously keeping in mind that Buffer(string) (without encoding) defaults to UTF-8 but iconvLite.encode doesn't.

TBH if one ends up with Buffer calls all over the codebase and needs to change the encoding in all of them, it seems like code smell. It's not node's job to help with bad architectural decisions.

And even in your example, you'd still have to carefully "upgrade" all the Buffer(string) calls to Buffer(string, 'whatever-encoding') anyway.

It's such a rare edge case that adding an API that will likely cause many issues down the road for questionable benefit doesn't seem worth it.

@jasnell
Copy link
Member

jasnell commented Mar 16, 2016

Recommending that this be closed. Modules that want to provide support for other encodings can do so easily without monkey patching node or having node provide any kind of extension mechanism. I don't think we should be encouraging this anti-pattern further.

@bnoordhuis
Copy link
Member

Let's close then.

Mithgol added a commit to Mithgol/dauria that referenced this issue Mar 16, 2016
Mithgol added a commit to Mithgol/fiunis that referenced this issue Mar 16, 2016
Mithgol added a commit to Mithgol/simteconf that referenced this issue Mar 16, 2016
Mithgol added a commit to Mithgol/echolist-csv2hpt that referenced this issue Mar 16, 2016
Mithgol added a commit to Mithgol/node-fidonet-jam that referenced this issue Mar 16, 2016
Mithgol added a commit to Mithgol/node-twi2fido that referenced this issue Mar 16, 2016
Mithgol added a commit to Mithgol/node-twi2fido that referenced this issue Mar 16, 2016
@Mithgol
Copy link
Author

Mithgol commented Mar 16, 2016

Quick summary of changes that are necessary to do without require('iconv-lite').extendNodeEncodings():

  • Obvious replacements:
    • Buffer.isEncoding → iconv.encodingExists
    • buf.toString(encoding)iconv.decode(buf, encoding)
    • buf(str, encoding)iconv.encode(str, encoding)
  • Cannot use method chaining; for example, Buffer(str).toString('utf7') becomes iconv.decode(iconv.encode(str, 'utf8'), 'utf7')
  • Cannot use encoding options of fs methods; for example,
    • fs.readFileSync(filename, {encoding: encoding}) → iconv.decode(fs.readFileSync(filename), encoding)
    • fs.writeFileSync(filename, content, {encoding: encoding}) → fs.writeFileSync(filename, iconv.encode(content, encoding))
  • Cannot use simple buf.toString(encoding, start, end) conversion, should nest iconv.decode(buf.slice(start, end), encoding)
  • Sometimes adding iconv-lite to dependencies becomes necessary to check iconv.encodingExists explicitly in places where Buffer.isEncoding silently worked because of require('iconv-lite').extendNodeEncodings() in another dependency.

¯\_(ツ)_/¯

@ChALkeR
Copy link
Member

ChALkeR commented Jul 19, 2017

@Mithgol #13644 seems to solve this in a nicer (and more standardized) way. Does it cover the usecases which you have in mind for this?

@Mithgol
Copy link
Author

Mithgol commented Jul 20, 2017

@ChALkeR Unfortunately, not all of them, because the list of WHATWG-supported encodings seems to be quite limited.

For example, UTF-7 is not supported and hence I cannot use it for Fidonet Unicode substrings.

I'd better stay on iconv-lite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. feature request Issues that request new features to be added to Node.js.
Projects
None yet
Development

No branches or pull requests

9 participants