-
Notifications
You must be signed in to change notification settings - Fork 29.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] proposed new unicode module #8075
Conversation
Returns the character located at the given offset in the `Buffer`. Works even | ||
if the offset falls in the middle of a multibyte UTF-8 or UTF-16 character. | ||
|
||
## unicode.utf8Slice(buf, start, end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why restrict this to just utf8? What about just having general slice()
and length()
functions that accept an encoding name? I'm thinking this would be beneficial for other multi-byte encodings as well, such as utf16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just UTF8 for now, I'm going to expand this out to support each of our main encodings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that said, out of the encodings we support out of the box (utf8, utf16le, latin1, and ascii), utf8 is the only one that's difficult here. latin1 and ascii would fallback to Buffer.prototype.split and Buffer.prototype.length, and utf16le is always 2 bytes so length is always Buffer.prototype.length / 2 and split is straightforward.
Where it begins to get complicated is if we decide to support the full range of encodings offered by ICU with full-data.
Nice, I’m excited! I’ve PRed with some of the changes I’m suggesting here at jasnell#10, feel free to do with those whatever you like! |
Awesome, thanks all. The impl here is still pretty rough and I already have some updates queued up. I'll work in the suggested changes and I'll take a look at your pr @addaleax. Much appreciated! |
* `encoding` The character encoding of the `Buffer` data. Default = `'utf8'` | ||
|
||
Returns the Unicode codepoint located at the given offset in the `Buffer`. Works | ||
even if the offset falls in the middle of a multibyte UTF-8 or UTF-16 character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it mean it returns the same codepoint if you are at the start or in the middle of a UTF-8 character?
If that is the case, it may be good to say that the behavior is different than String.codePointAt
The name of this module conflicts with https://www.npmjs.com/package/unicode. How do you plan to handle this ? |
throw new TypeError('First argument must be a Buffer'); | ||
if (typeof end === 'undefined') end = buf.length; | ||
start >>= 0; | ||
end >>= 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't operate like String#slice()
or Buffer#slice()
. If a negative number is passed then it should slice going backwards. There are some other minor argument coercion differences between this and how Typed Array handles arguments. Buffer should also have its argument parsing reviewed and fixed, but might as well start now. Here's an idea of how they should be parsed:
if (!start) start = 0;
// This works b/c "start < 0" calls valueOf() from start. Evaluate this
// way to be consistent with how Typed Array evaluates. For example:
// ui8array.subarray(4.6, 5.1): 4.6 -> 4; 5.1 -> 5
// ui8array.subarray(-3.1, -1.9): -3.1 -> -3; -1.9 -> -1
else start = start < 0 ? Math.ceil(start) : Math.floor(start);
if (end === undefined) end = buf.length;
else if (!end) end = 0;
else end = end < 0 ? Math.ceil(end) : Math.floor(end);
if (end <= start) return Buffer.alloc(0);
There's probably be a simpler way to achieve this, but intentionally wanted this to be verbose to easily see what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've had a standing todo to go back and reconcile this. I'll hit that in the next update.
2b35155
to
b5457d8
Compare
* Moves transcode and normalize to buffer module statics * Moves getCharacterProperty and getColumnWidth to util
My plan on this is to back off adding a new module. And instead: a) Have separate PRs that add transcode and normalize as statics on require('buffer') ... NOT as prototype methods on |
Closed this so I can work on breaking it up into individual PRs |
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: nodejs#8075
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: #8075 PR-URL: #9040 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Steven R Loomis <srloomis@us.ibm.com>
Add buffer.transcode(source, from, to) method. Primarily uses ICU to transcode a buffer's content from one of Node.js' supported encodings to another. Originally part of a proposal to add a new unicode module. Decided to refactor the approach towrds individual PRs without a new module. Refs: nodejs#8075
Add buffer.transcode(source, from, to) method. Primarily uses ICU to transcode a buffer's content from one of Node.js' supported encodings to another. Originally part of a proposal to add a new unicode module. Decided to refactor the approach towrds individual PRs without a new module. Refs: #8075 PR-URL: #9038 Reviewed-By: Anna Henningsen <anna@addaleax.net>
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: #8075 PR-URL: #9040 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Steven R Loomis <srloomis@us.ibm.com>
Add buffer.transcode(source, from, to) method. Primarily uses ICU to transcode a buffer's content from one of Node.js' supported encodings to another. Originally part of a proposal to add a new unicode module. Decided to refactor the approach towrds individual PRs without a new module. Refs: #8075 PR-URL: #9038 Reviewed-By: Anna Henningsen <anna@addaleax.net>
The Node.js enhancement proposal nodejs/node-eps#15 proposes the addition of a new
'unicode'
module that introduces a number of ICU backed APIs, including the ability to transcode a Buffer from one encoding to another. This is the prototype implementation of the new module.It is a work in progress. This is not intended to be landed at this time.
/cc @trevnorris @addaleax @srl295