[WIP] proposed new unicode module #8075

jasnell · 2016-08-12T01:42:22Z

The Node.js enhancement proposal nodejs/node-eps#15 proposes the addition of a new 'unicode' module that introduces a number of ICU backed APIs, including the ability to transcode a Buffer from one encoding to another. This is the prototype implementation of the new module.

It is a work in progress. This is not intended to be landed at this time.

/cc @trevnorris @addaleax @srl295

mscdex · 2016-08-12T01:57:48Z

doc/api/unicode.md

+Returns the character located at the given offset in the `Buffer`. Works even
+if the offset falls in the middle of a multibyte UTF-8 or UTF-16 character.
+
+## unicode.utf8Slice(buf, start, end)


Why restrict this to just utf8? What about just having general slice() and length() functions that accept an encoding name? I'm thinking this would be beneficial for other multi-byte encodings as well, such as utf16.

It's just UTF8 for now, I'm going to expand this out to support each of our main encodings.

that said, out of the encodings we support out of the box (utf8, utf16le, latin1, and ascii), utf8 is the only one that's difficult here. latin1 and ascii would fallback to Buffer.prototype.split and Buffer.prototype.length, and utf16le is always 2 bytes so length is always Buffer.prototype.length / 2 and split is straightforward.

Where it begins to get complicated is if we decide to support the full range of encodings offered by ICU with full-data.

addaleax · 2016-08-14T00:52:53Z

Nice, I’m excited! I’ve PRed with some of the changes I’m suggesting here at jasnell#10, feel free to do with those whatever you like!

jasnell · 2016-08-14T01:55:40Z

Awesome, thanks all. The impl here is still pretty rough and I already have some updates queued up. I'll work in the suggested changes and I'll take a look at your pr @addaleax. Much appreciated!

targos · 2016-08-14T06:18:22Z

doc/api/unicode.md

+* `encoding` The character encoding of the `Buffer` data. Default = `'utf8'`
+
+Returns the Unicode codepoint located at the given offset in the `Buffer`. Works
+even if the offset falls in the middle of a multibyte UTF-8 or UTF-16 character.


does it mean it returns the same codepoint if you are at the start or in the middle of a UTF-8 character?
If that is the case, it may be good to say that the behavior is different than String.codePointAt

targos · 2016-08-14T07:24:32Z

The name of this module conflicts with https://www.npmjs.com/package/unicode. How do you plan to handle this ?

trevnorris · 2016-08-17T20:58:11Z

lib/icu.js

+    throw new TypeError('First argument must be a Buffer');
+  if (typeof end === 'undefined') end = buf.length;
+  start >>= 0;
+  end >>= 0;


This doesn't operate like String#slice() or Buffer#slice(). If a negative number is passed then it should slice going backwards. There are some other minor argument coercion differences between this and how Typed Array handles arguments. Buffer should also have its argument parsing reviewed and fixed, but might as well start now. Here's an idea of how they should be parsed:

if (!start) start = 0; // This works b/c "start < 0" calls valueOf() from start. Evaluate this // way to be consistent with how Typed Array evaluates. For example: // ui8array.subarray(4.6, 5.1): 4.6 -> 4; 5.1 -> 5 // ui8array.subarray(-3.1, -1.9): -3.1 -> -3; -1.9 -> -1 else start = start < 0 ? Math.ceil(start) : Math.floor(start); if (end === undefined) end = buf.length; else if (!end) end = 0; else end = end < 0 ? Math.ceil(end) : Math.floor(end); if (end <= start) return Buffer.alloc(0);

There's probably be a simpler way to achieve this, but intentionally wanted this to be verbose to easily see what's going on.

Yeah, I've had a standing todo to go back and reconcile this. I'll hit that in the next update.

* Moves transcode and normalize to buffer module statics * Moves getCharacterProperty and getColumnWidth to util

jasnell · 2016-09-27T21:56:59Z

My plan on this is to back off adding a new module. And instead:

a) Have separate PRs that add transcode and normalize as statics on require('buffer') ... NOT as prototype methods on Buffer but as statics on the module itself; and
b) To introduce getCharacterProperty() and getColumnWidth() in separate PRs.

jasnell · 2016-10-11T21:24:47Z

Closed this so I can work on breaking it up into individual PRs

Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: nodejs#8075

Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: #8075 PR-URL: #9040 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Steven R Loomis <srloomis@us.ibm.com>

Add buffer.transcode(source, from, to) method. Primarily uses ICU to transcode a buffer's content from one of Node.js' supported encodings to another. Originally part of a proposal to add a new unicode module. Decided to refactor the approach towrds individual PRs without a new module. Refs: nodejs#8075

Add buffer.transcode(source, from, to) method. Primarily uses ICU to transcode a buffer's content from one of Node.js' supported encodings to another. Originally part of a proposal to add a new unicode module. Decided to refactor the approach towrds individual PRs without a new module. Refs: #8075 PR-URL: #9038 Reviewed-By: Anna Henningsen <anna@addaleax.net>

Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: #8075 PR-URL: #9040 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Steven R Loomis <srloomis@us.ibm.com>

Add buffer.transcode(source, from, to) method. Primarily uses ICU to transcode a buffer's content from one of Node.js' supported encodings to another. Originally part of a proposal to add a new unicode module. Decided to refactor the approach towrds individual PRs without a new module. Refs: #8075 PR-URL: #9038 Reviewed-By: Anna Henningsen <anna@addaleax.net>

jasnell added wip Issues and PRs that are still a work in progress. i18n-api Issues and PRs related to the i18n implementation. labels Aug 12, 2016

nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. build Issues and PRs related to build files or the CI. unicode labels Aug 12, 2016

jasnell mentioned this pull request Aug 12, 2016

00X: Add 'icu' module nodejs/node-eps#15

Closed

mscdex reviewed Aug 12, 2016
View reviewed changes

targos reviewed Aug 14, 2016
View reviewed changes

jasnell force-pushed the icu-module branch from 842e081 to 2b0a869 Compare August 15, 2016 17:51

trevnorris reviewed Aug 17, 2016
View reviewed changes

jasnell force-pushed the icu-module branch 4 times, most recently from 2b35155 to b5457d8 Compare September 12, 2016 20:46

Trott force-pushed the master branch from b0df363 to c5ce7f4 Compare September 21, 2016 00:09

jasnell force-pushed the icu-module branch from 655bcd5 to eba6e33 Compare September 27, 2016 20:58

icu: add icu module

3d261d3

* Moves transcode and normalize to buffer module statics * Moves getCharacterProperty and getColumnWidth to util

jasnell force-pushed the icu-module branch from eba6e33 to 3d261d3 Compare September 27, 2016 21:47

jasnell closed this Oct 11, 2016

jasnell mentioned this pull request Oct 11, 2016

buffer: add buffer.transcode #9038

Closed

4 tasks

jasnell mentioned this pull request Oct 11, 2016

readline: use icu based string width calculation #9040

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] proposed new unicode module #8075

[WIP] proposed new unicode module #8075

jasnell commented Aug 12, 2016 •

edited

Loading

mscdex Aug 12, 2016 •

edited

Loading

jasnell Aug 12, 2016

jasnell Aug 12, 2016

addaleax commented Aug 14, 2016

jasnell commented Aug 14, 2016

targos Aug 14, 2016

targos commented Aug 14, 2016

trevnorris Aug 17, 2016

jasnell Aug 17, 2016

jasnell commented Sep 27, 2016

jasnell commented Oct 11, 2016

[WIP] proposed new unicode module #8075

[WIP] proposed new unicode module #8075

Conversation

jasnell commented Aug 12, 2016 • edited Loading

mscdex Aug 12, 2016 • edited Loading

Choose a reason for hiding this comment

jasnell Aug 12, 2016

Choose a reason for hiding this comment

jasnell Aug 12, 2016

Choose a reason for hiding this comment

addaleax commented Aug 14, 2016

jasnell commented Aug 14, 2016

targos Aug 14, 2016

Choose a reason for hiding this comment

targos commented Aug 14, 2016

trevnorris Aug 17, 2016

Choose a reason for hiding this comment

jasnell Aug 17, 2016

Choose a reason for hiding this comment

jasnell commented Sep 27, 2016

jasnell commented Oct 11, 2016

jasnell commented Aug 12, 2016 •

edited

Loading

mscdex Aug 12, 2016 •

edited

Loading