Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] proposed new unicode module #8075

Closed
wants to merge 1 commit into from
Closed

Conversation

jasnell
Copy link
Member

@jasnell jasnell commented Aug 12, 2016

The Node.js enhancement proposal nodejs/node-eps#15 proposes the addition of a new 'unicode' module that introduces a number of ICU backed APIs, including the ability to transcode a Buffer from one encoding to another. This is the prototype implementation of the new module.

It is a work in progress. This is not intended to be landed at this time.

/cc @trevnorris @addaleax @srl295

@jasnell jasnell added wip Issues and PRs that are still a work in progress. i18n-api Issues and PRs related to the i18n implementation. labels Aug 12, 2016
@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. build Issues and PRs related to build files or the CI. unicode labels Aug 12, 2016
Returns the character located at the given offset in the `Buffer`. Works even
if the offset falls in the middle of a multibyte UTF-8 or UTF-16 character.

## unicode.utf8Slice(buf, start, end)
Copy link
Contributor

@mscdex mscdex Aug 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why restrict this to just utf8? What about just having general slice() and length() functions that accept an encoding name? I'm thinking this would be beneficial for other multi-byte encodings as well, such as utf16.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just UTF8 for now, I'm going to expand this out to support each of our main encodings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that said, out of the encodings we support out of the box (utf8, utf16le, latin1, and ascii), utf8 is the only one that's difficult here. latin1 and ascii would fallback to Buffer.prototype.split and Buffer.prototype.length, and utf16le is always 2 bytes so length is always Buffer.prototype.length / 2 and split is straightforward.

Where it begins to get complicated is if we decide to support the full range of encodings offered by ICU with full-data.

@addaleax
Copy link
Member

Nice, I’m excited! I’ve PRed with some of the changes I’m suggesting here at jasnell#10, feel free to do with those whatever you like!

@jasnell
Copy link
Member Author

jasnell commented Aug 14, 2016

Awesome, thanks all. The impl here is still pretty rough and I already have some updates queued up. I'll work in the suggested changes and I'll take a look at your pr @addaleax. Much appreciated!

* `encoding` The character encoding of the `Buffer` data. Default = `'utf8'`

Returns the Unicode codepoint located at the given offset in the `Buffer`. Works
even if the offset falls in the middle of a multibyte UTF-8 or UTF-16 character.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean it returns the same codepoint if you are at the start or in the middle of a UTF-8 character?
If that is the case, it may be good to say that the behavior is different than String.codePointAt

@targos
Copy link
Member

targos commented Aug 14, 2016

The name of this module conflicts with https://www.npmjs.com/package/unicode. How do you plan to handle this ?

throw new TypeError('First argument must be a Buffer');
if (typeof end === 'undefined') end = buf.length;
start >>= 0;
end >>= 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't operate like String#slice() or Buffer#slice(). If a negative number is passed then it should slice going backwards. There are some other minor argument coercion differences between this and how Typed Array handles arguments. Buffer should also have its argument parsing reviewed and fixed, but might as well start now. Here's an idea of how they should be parsed:

if (!start) start = 0;
// This works b/c "start < 0" calls valueOf() from start. Evaluate this
// way to be consistent with how Typed Array evaluates. For example:
//   ui8array.subarray(4.6, 5.1): 4.6 -> 4; 5.1 -> 5
//   ui8array.subarray(-3.1, -1.9): -3.1 -> -3; -1.9 -> -1
else start = start < 0 ? Math.ceil(start) : Math.floor(start);

if (end === undefined) end = buf.length;
else if (!end) end = 0;
else end = end < 0 ? Math.ceil(end) : Math.floor(end);

if (end <= start) return Buffer.alloc(0);

There's probably be a simpler way to achieve this, but intentionally wanted this to be verbose to easily see what's going on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've had a standing todo to go back and reconcile this. I'll hit that in the next update.

* Moves transcode and normalize to buffer module statics
* Moves getCharacterProperty and getColumnWidth to util
@jasnell
Copy link
Member Author

jasnell commented Sep 27, 2016

My plan on this is to back off adding a new module. And instead:

a) Have separate PRs that add transcode and normalize as statics on require('buffer') ... NOT as prototype methods on Buffer but as statics on the module itself; and
b) To introduce getCharacterProperty() and getColumnWidth() in separate PRs.

@jasnell jasnell closed this Oct 11, 2016
@jasnell jasnell mentioned this pull request Oct 11, 2016
4 tasks
@jasnell
Copy link
Member Author

jasnell commented Oct 11, 2016

Closed this so I can work on breaking it up into individual PRs

jasnell added a commit to jasnell/node that referenced this pull request Oct 25, 2016
Rather than the pseudo-wcwidth impl used currently, use the ICU
character properties database to calculate string width and
determine if a character is full width or not. This allows the
algorithm to correctly identify emoji's as full width, ensures
the algorithm will continue to fucntion properly as new unicode
codepoints are added, and it's faster.

This was originally part of a proposal to add a new unicode module,
but has been split out.

Refs: nodejs#8075
jasnell added a commit that referenced this pull request Oct 25, 2016
Rather than the pseudo-wcwidth impl used currently, use the ICU
character properties database to calculate string width and
determine if a character is full width or not. This allows the
algorithm to correctly identify emoji's as full width, ensures
the algorithm will continue to fucntion properly as new unicode
codepoints are added, and it's faster.

This was originally part of a proposal to add a new unicode module,
but has been split out.

Refs: #8075
PR-URL: #9040
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Steven R Loomis <srloomis@us.ibm.com>
jasnell added a commit to jasnell/node that referenced this pull request Oct 25, 2016
Add buffer.transcode(source, from, to) method. Primarily uses ICU
to transcode a buffer's content from one of Node.js' supported
encodings to another.

Originally part of a proposal to add a new unicode module. Decided
to refactor the approach towrds individual PRs without a new module.

Refs: nodejs#8075
jasnell added a commit that referenced this pull request Oct 25, 2016
Add buffer.transcode(source, from, to) method. Primarily uses ICU
to transcode a buffer's content from one of Node.js' supported
encodings to another.

Originally part of a proposal to add a new unicode module. Decided
to refactor the approach towrds individual PRs without a new module.

Refs: #8075
PR-URL: #9038
Reviewed-By: Anna Henningsen <anna@addaleax.net>
evanlucas pushed a commit that referenced this pull request Nov 3, 2016
Rather than the pseudo-wcwidth impl used currently, use the ICU
character properties database to calculate string width and
determine if a character is full width or not. This allows the
algorithm to correctly identify emoji's as full width, ensures
the algorithm will continue to fucntion properly as new unicode
codepoints are added, and it's faster.

This was originally part of a proposal to add a new unicode module,
but has been split out.

Refs: #8075
PR-URL: #9040
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Steven R Loomis <srloomis@us.ibm.com>
evanlucas pushed a commit that referenced this pull request Nov 3, 2016
Add buffer.transcode(source, from, to) method. Primarily uses ICU
to transcode a buffer's content from one of Node.js' supported
encodings to another.

Originally part of a proposal to add a new unicode module. Decided
to refactor the approach towrds individual PRs without a new module.

Refs: #8075
PR-URL: #9038
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Issues and PRs related to build files or the CI. c++ Issues and PRs that require attention from people who are familiar with C++. i18n-api Issues and PRs related to the i18n implementation. wip Issues and PRs that are still a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants