Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: allow simdutf::convert_* functions to return zero #47471

Merged
merged 6 commits into from
Apr 9, 2023
Merged

src: allow simdutf::convert_* functions to return zero #47471

merged 6 commits into from
Apr 9, 2023

Conversation

lemire
Copy link
Member

@lemire lemire commented Apr 7, 2023

When transcoding with the simdutf library, you first scan the input to determine the size of the output (e.g., you scan the UTF-8 input to determine the size of the UTF-16 output). In a second step, you call a transcoding function. This transcoding function normally returns how many words were written. This number of words should match the size of the output computed during the first scan.

So you get three-line routines like as follow (scan, allocate, transcode):

size_t expected_utf16_length =
        simdutf::utf16_length_from_utf8(string.data(), string.length());
MaybeStackBuffer<char16_t> buffer(expected_utf16_length);
size_t utf16_length = simdutf::convert_utf8_to_utf16(
        string.data(), string.length(), buffer.out());

The scan to determine the size of the output does not validate the Unicode input: the validation occurs during the transcoding. For performance purposes, it will only seek to tell you how much memory you need to allocate, counting on the transcoding step to do the validation.

When the transcoding fails, the simdutf::convert_utf8_to_utf16 and simdutf::convert_utf16_to_utf8 functions return zero by convention, indicating an error. So you either have a successful transcoding (from valid Unicode to valid Unicode) in which case the transcoding function returns the number of written words, which matches exactly the expected number of output words, or you get zero, indicating that the input is invalid Unicode.

Currently, the simdutf library is used within src/inspector/node_string.cc with checks such as CHECK_EQ(expected_utf16_length, utf16_length);. In effect, these checks are true if and only if the inputs are valid Unicode. That should almost always be the case within Node. However, @danpeixoto reports that the check fail in their case, see #47457

I cannot reproduce @danpeixoto's issue. See my comments on the issue. Nevertheless, it seems warranted to make the code more robust in case we do have bad Unicode inputs.

This is what this PR does: it checks whether the transcoding functions return 0, and if it does, then it assumes that the input was invalid.

By convention, the routines return the empty string or a null, when the input was invalid. This could be changed to some other convention.

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. labels Apr 7, 2023
@lemire lemire changed the title src: allow simdutf::convert_* functions to return zero in case of invalid unicode inputs src: allow simdutf::convert_* functions to return zero Apr 7, 2023
Copy link
Member

@Trott Trott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we write a test that fails with the current main branch but passes with this change?

@Trott
Copy link
Member

Trott commented Apr 7, 2023

Should we write a test that fails with the current main branch but passes with this change?

I guess the answer depends on what is found once someone is able to reproduce #47457 and find the source of the invalid UTF-8.

@lemire
Copy link
Member Author

lemire commented Apr 7, 2023

Should we write a test that fails with the current main branch

Such a test would trigger a CHECK_EQ in the current node and thus it would abort in the current node, but continue with this PR.

It seems that the current tests, those that stress the functions in src/inspector/node_string.cc, assume that the input is valid Unicode. For all I know, we want exactly this assumption and my PR is undesirable.

It is possible that in issue #47457, we want the user to come to node and complain, rather than hide the issue as the current PR would do.

src/inspector/node_string.cc Outdated Show resolved Hide resolved
src/inspector/node_string.cc Outdated Show resolved Hide resolved
lemire and others added 2 commits April 7, 2023 18:24
Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>
Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>
@anonrig anonrig added author ready PRs that have at least one approval, no pending requests for changes, and a CI started. request-ci Add this label to start a Jenkins CI on a PR. commit-queue-squash Add this label to instruct the Commit Queue to squash all the PR commits into the first one. labels Apr 8, 2023
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Apr 8, 2023
@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@Trott Trott added the commit-queue Add this label to land a pull request using GitHub Actions. label Apr 8, 2023
@nodejs-github-bot nodejs-github-bot removed the commit-queue Add this label to land a pull request using GitHub Actions. label Apr 9, 2023
@nodejs-github-bot nodejs-github-bot merged commit 63ee335 into nodejs:main Apr 9, 2023
@nodejs-github-bot
Copy link
Collaborator

Landed in 63ee335

RafaelGSS pushed a commit that referenced this pull request Apr 13, 2023
PR-URL: #47471
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
danielleadams pushed a commit that referenced this pull request Jul 6, 2023
PR-URL: #47471
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
MoLow pushed a commit to MoLow/node that referenced this pull request Jul 6, 2023
PR-URL: nodejs#47471
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. c++ Issues and PRs that require attention from people who are familiar with C++. commit-queue-squash Add this label to instruct the Commit Queue to squash all the PR commits into the first one. needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants