Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer: add buffer.transcode #9038

Closed
wants to merge 1 commit into from
Closed

Conversation

jasnell
Copy link
Member

@jasnell jasnell commented Oct 11, 2016

Checklist
  • make -j8 test (UNIX), or vcbuild test nosign (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines
Affected core subsystem(s)

buffer

Description of change

Add buffer.transcode(source, from, to) method. Primarily uses ICU to transcode a buffer's content from one of Node.js' supported encodings to another.

Originally part of a proposal to add a new unicode module. Decided to refactor the approach towrds individual PRs without a new module.

Refs: #8075
/cc @trevnorris @addaleax

@jasnell jasnell added the buffer Issues and PRs related to the buffer subsystem. label Oct 11, 2016
@nodejs-github-bot nodejs-github-bot added build Issues and PRs related to build files or the CI. c++ Issues and PRs that require attention from people who are familiar with C++. tools Issues and PRs related to the tools directory. labels Oct 11, 2016
@jasnell
Copy link
Member Author

jasnell commented Oct 11, 2016

@jasnell jasnell added the semver-minor PRs that contain new features and should be released in the next minor version. label Oct 11, 2016
const Buffer = require('buffer').Buffer;
const normalizeEncoding = require('internal/util').normalizeEncoding;

if (process.binding('config').hasIntl) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about possibly placing this in lib/internal/buffer-transcode.js and conditionally require()'ing it. purely cosmetic though, to prevent an extra level of indent. or you could just return early. :)

if (!process.binding('config').hasIntl)
  return;

// Buffer instance.
exports.transcode = function transcode(source, from_enc, to_enc) {
if (!source || !(source.buffer instanceof ArrayBuffer))
throw new TypeError('"source" argument must be a Buffer');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We going to have a complaint about not supporting SharedArrayBuffer for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, perhaps. Not too worried about that for now.

@jasnell
Copy link
Member Author

jasnell commented Oct 12, 2016

Updated

Returns a new `Buffer` instance.

Throws if transcoding is not possible or if one of the specified encodings is
invalid or unknown.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe give an example of transcoding is not possible?

return e;
}

#define THROW_ICU_ERROR(env, status, msg) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm – this could be a ThrowICUError function, right?


MaybeLocal<Object> AsBuffer(Isolate* isolate,
MaybeStackBuffer<char>* buf,
size_t len) {
Copy link
Member

@addaleax addaleax Oct 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine but at some point this might become a member of the MaybeStackBuffer class? I realize that would conflict a bit with the MaybeStackBuffer<UChar> overload, maybe leave a TODO here?

const icu = process.binding('icu');

// Maps the supported transcoding conversions. The top key is the from_enc,
// the child key is the to_enc. The value is the transcoding function to.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that final to residue from editing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, slight brain malfunction there I think ;-)

return source.toString('base64');
},
'hex': (source) => {
return source.toString('hex');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhhh this returns a string or a Buffer depending on the target encoding? I don’t think binary-to-text encodings should be allowed here, .toString() is the right method for them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right. I'll pull these back out.

@jasnell
Copy link
Member Author

jasnell commented Oct 12, 2016

@addaleax ... updated! PTAL

Copy link
Member

@addaleax addaleax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice!

const uint8_t hi = static_cast<uint8_t>(ts_obj_data[n + 0]);
const uint8_t lo = static_cast<uint8_t>(ts_obj_data[n + 1]);
swapspace[i] = (hi << 8) | lo;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this use SwapBytes16?

const uint8_t hi = static_cast<uint8_t>(ts_obj_data[n + 0]);
const uint8_t lo = static_cast<uint8_t>(ts_obj_data[n + 1]);
swapspace[i] = (hi << 8) | lo;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ditto)

MaybeStackBuffer<char> buf;
int32_t len;

u_strToUTF8(*buf, 1024, &len, source, length, &status);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1024 seem kind of magic here, although I realize that is largely my fault. 😄 (Not sure if there’s anything to do about that)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed now!


// Transcodes the Buffer from one encoding to another, returning a new
// Buffer instance.
exports.transcode = function transcode(source, from_enc, to_enc) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: s/from_enc/fromEncoding/ and s/to_enc/toEncoding/. Ditto for cnv_from and cnv_to.

msg = "Unspecified ICU Exception";
Local<String> cons =
String::Concat(estring, FIXED_ONE_BYTE_STRING(env->isolate(), ", "));
cons = String::Concat(cons, OneByteString(env->isolate(), msg));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize you adapted this code from elsewhere but using snprintf() to format the error message will be much more efficient.

String::Concat(estring, FIXED_ONE_BYTE_STRING(env->isolate(), ", "));
cons = String::Concat(cons, OneByteString(env->isolate(), msg));

Local<Value> e = Exception::Error(cons);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a if (e.Empty()) return Local<Value>();.

size_t len) {
if (buf->IsAllocated()) {
MaybeLocal<Object> ret = Buffer::New(isolate, buf->out(), len);
buf->Release();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be if (!ret.Empty()) buf->Release(); - it's leaking memory now when the buffer can't be created.

if (buf->IsAllocated()) {
MaybeLocal<Object> ret =
Buffer::New(isolate, reinterpret_cast<char*>(buf->out()), len);
buf->Release();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

UChar* source = nullptr;
MaybeStackBuffer<UChar> swapspace;
if (IsLittleEndian()) {
source = reinterpret_cast<UChar*>(ts_obj_data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: strict aliasing violation and prone to crashing.

} else {
ThrowICUError(env, status, "Unable to transcode buffer");
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there is ample opportunity to share code between Ucs2FromUtf8 and Utf8FromUcs2, they are 80% identical.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. For now I'm more inclined to keep these separate as it makes finding and tweaking bugs a bit easier. I'll take another pass in a separate PR to condense things down.

}

void Release() {
buf_ = buf_st_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this also reset length_?

return env->ThrowTypeError("argument should be a Buffer"); \
} while (0)

#define SPREAD_ARG(val, name) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move this into a common header, you might want to give it a slightly less generic name; e.g. SPREAD_BUFFER_ARG.

@@ -21,7 +21,7 @@
'toolsets': [ 'target' ],
'direct_dependent_settings': {
'defines': [
'UCONFIG_NO_CONVERSION=1',
#'UCONFIG_NO_CONVERSION=1',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove the 'defines' block instead of commenting it out.

@jasnell jasnell force-pushed the buffer-transcode branch 3 times, most recently from 9e0464b to 2da087e Compare October 13, 2016 15:11
@jasnell
Copy link
Member Author

jasnell commented Oct 13, 2016

@jasnell
Copy link
Member Author

jasnell commented Oct 13, 2016

Another CI run after cleanups: https://ci.nodejs.org/job/node-test-pull-request/4507/ ... that last run was less than successful....
Trying another: https://ci.nodejs.org/job/node-test-pull-request/4508/

@jasnell
Copy link
Member Author

jasnell commented Oct 13, 2016

CI looks good. @bnoordhuis PTAL... LGTY?

@jasnell
Copy link
Member Author

jasnell commented Oct 17, 2016

ping @bnoordhuis

buf, v8::NewStringType::kNormal,
len).ToLocalChecked());
if (e.IsEmpty())
return Local<Value>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're returning empty handles as an indicator it's probably be more "V8-ish" have the return signature as a MaybeLocal<Value> instead. been trying to do that in other locations myself.

char buf[kStorageSize];
int len = snprintf(buf, sizeof(buf), "%s [%s]", msg, u_errorName(status));

Local<Value> e =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating Local<Value>'s but there's no HandleScope. if this callback is expected to always be called within an existing HandleScope (like MakeCallback), mind putting a comment at the top. also like MakeCallback (see src/node.h).

len).ToLocalChecked());
if (e.IsEmpty())
return Local<Value>();
Local<Object> obj = e->ToObject(env->isolate());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we know this is a v8::Object then can use e.As<Object>(). that's also more explicit that no extra handle is being created.

obj->Set(env->code_string(),
String::NewFromUtf8(env->isolate(),
u_errorName(status), v8::NewStringType::kNormal)
.ToLocalChecked());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new v8::Maybe<T> API for v8::Object::Set() is annoying and ugly, but if if we're going to use some of the new API might as well use all of it.

if (!ret.IsEmpty()) buf->Release();
return ret;
}
return Buffer::Copy(isolate, dst, len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're manipulating the original memory, why bother take a copy?

if (U_SUCCESS(status)) {
len = target - *buf;
args.GetReturnValue().Set(
AsBuffer(env->isolate(), &buf, len).ToLocalChecked());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this operation fails, do we want to abort or throw?

for (size_t n = 0, i = 0; i < length; n += 2, i += 1) {
const uint8_t hi = static_cast<uint8_t>(ts_obj_data[n + 0]);
const uint8_t lo = static_cast<uint8_t>(ts_obj_data[n + 1]);
swapspace[i] = (lo << 8) | hi;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for future performance enhancement, detect the alignment of the pointer and perform as many swaps that can be done in a single go.

const conversions = {
'ascii': {
'latin1': (source) => {
return Buffer.from(source);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm a little confused by this whole object, but right here if we're converting ascii to latin1 shouldn't we be passing 'latin1' as the encoding argument to Buffer.from()?

@jasnell jasnell force-pushed the buffer-transcode branch 3 times, most recently from 6f70820 to 65144bb Compare October 18, 2016 21:43
@jasnell
Copy link
Member Author

jasnell commented Oct 18, 2016

@bnoordhuis ... ok, reworked the implementation with an eye towards simplification and reducing duplication. PTAL
@trevnorris and @addaleax ... if I could trouble each of you to take another look also, I'd appreciate it.

@jasnell
Copy link
Member Author

jasnell commented Oct 21, 2016

ping @bnoordhuis @trevnorris @addaleax

@jasnell
Copy link
Member Author

jasnell commented Oct 24, 2016

Thank you for the follow up review @addaleax. Will get this landed tomorrow if there are no further objections.

Add buffer.transcode(source, from, to) method. Primarily uses ICU
to transcode a buffer's content from one of Node.js' supported
encodings to another.

Originally part of a proposal to add a new unicode module. Decided
to refactor the approach towrds individual PRs without a new module.

Refs: nodejs#8075
@jasnell
Copy link
Member Author

jasnell commented Oct 25, 2016

New CI run after squashing: https://ci.nodejs.org/job/node-test-pull-request/4665/

@jasnell
Copy link
Member Author

jasnell commented Oct 25, 2016

green except for unrelated failures. landing

@jasnell jasnell dismissed bnoordhuis’s stale review October 25, 2016 17:11

PR updated after view

jasnell added a commit that referenced this pull request Oct 25, 2016
Add buffer.transcode(source, from, to) method. Primarily uses ICU
to transcode a buffer's content from one of Node.js' supported
encodings to another.

Originally part of a proposal to add a new unicode module. Decided
to refactor the approach towrds individual PRs without a new module.

Refs: #8075
PR-URL: #9038
Reviewed-By: Anna Henningsen <anna@addaleax.net>
@jasnell
Copy link
Member Author

jasnell commented Oct 25, 2016

Landed in e8eaaa7

@jasnell jasnell closed this Oct 25, 2016
@srl295 srl295 mentioned this pull request Oct 25, 2016
4 tasks
evanlucas pushed a commit that referenced this pull request Nov 3, 2016
Add buffer.transcode(source, from, to) method. Primarily uses ICU
to transcode a buffer's content from one of Node.js' supported
encodings to another.

Originally part of a proposal to add a new unicode module. Decided
to refactor the approach towrds individual PRs without a new module.

Refs: #8075
PR-URL: #9038
Reviewed-By: Anna Henningsen <anna@addaleax.net>
@addaleax
Copy link
Member

If this is backported to any of the other release lines, it needs to come with #9838

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. build Issues and PRs related to build files or the CI. c++ Issues and PRs that require attention from people who are familiar with C++. semver-minor PRs that contain new features and should be released in the next minor version. tools Issues and PRs related to the tools directory.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants