-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip,src: add utf8 consumer/validator #1319
wip,src: add utf8 consumer/validator #1319
Conversation
#define UNI_SUR_HIGH_START (uint32_t) 0xD800 | ||
#define UNI_SUR_LOW_END (uint32_t) 0xDFFF | ||
#define UNI_REPLACEMENT_CHAR (uint32_t) 0x0000FFFD | ||
#define UNI_MAX_LEGAL_UTF32 (uint32_t) 0x0010FFFF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: no C-style casts (here and elsewhere.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Sorry about that, missed while porting.
return false; | ||
} | ||
case 1: | ||
if (*input >= 0x80 && *input < 0xC2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0xC2? Shouldn't that be 0xC0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I'll look into that – the original version has it as 0xC2.
The msvc "equivalent" is Edit: you already knew :) sorry |
while (idx < length) { | ||
size_t advance = 0; | ||
uint32_t glyph = 0; | ||
uint8_t extrabytes = (uint8_t)clz(~(static_cast<int>(input[idx])<<24)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you want here can be implemented portable and reasonably efficient using the following:
inline uint32_t log2(uint8_t v) {
const uint32_t r = (v > 15) << 2;
v >>= r;
const uint32_t s = (v > 3) << 1;
v >>= s;
v >>= 1;
return r | s | v;
}
inline uint32_t clz(uint8_t v) {
// clz(0) == 7. Add a zero check if that's an issue.
return 7 - log2(v);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to mention, the behavior of static_cast<int>(...) << 24
is implementation-defined on platforms where ints are 32 bits (i.e. all of them.) You're not allowed to shift values into the sign bit.
} else { | ||
switch (extrabytes) { | ||
case 5: | ||
glyph += (uint8_t) input[i++]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this mask off the high bits? Also, when is extrabytes == 5
for valid UTF-8?
glyph -= offsets_from_utf8[extrabytes]; | ||
|
||
if (glyph > UNI_MAX_LEGAL_UTF32 || | ||
(glyph >= UNI_SUR_HIGH_START && glyph <= UNI_SUR_LOW_END)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Real Programmers(TM) write this as (glyph & 0x7800) != 0x5800
:-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Real Programmers(TM) write this as (glyph & 0x7800) != 0x5800 :-)
Compilers are pretty good in replacing idiomatic range comparisons with bit hacks. Have you checked whether it's really necessary to have these ... ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Har, har.
(But it's true that both clang and gcc manage to pull it off.)
Now that #1199 landed -- could we perhaps add unit tests for this? |
Nixed clz support, since a quick benchmark showed that @bnoordhuis' implementation was faster. |
inline size_t Skip( | ||
const size_t remaining, | ||
const uint8_t* input, | ||
const size_t glyph_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style issue: the first argument goes on the same line as the function name and the other arguments should line up below it. The only time that's deviated from is when the 80 column limit is exceeded.
I second @jbergstroem's sentiment on unit tests. :-) |
Also always interesting to run on an utf-8 decoder is the utf8 decoder stress test |
So I think I heard at some point v8 has a version of this for TypedArrays? Am I mistaken or will this soon be obsolete? |
@trevnorris does TypedArrays give us any sort of this functionality? |
Since this isn't used anywhere not even sure what it's for, but typed arrays don't have anything like this afaik. |
@chrisdickinson @trevnorris ... what the status on this one? |
Closing this due to inactivity. |
WIP. Opening this now to double check that this is headed in the right direction.
This is based on utf-8-validate. The primary differences are the use of
clz
(vs. a lookup table) to compute the number of extra bytes and the introduction of the error/glyph callbacks.The Utf8Consume function will iterate over valid glyphs, calling a provided
OnGlyph
callback and anOnError
callback as necessary. Provided error strategies are "Halt" and "Skip" – which either halts the consumer at the first error, or skips past them as appropriate.The strategies were added to accommodate the desire to build a utf8-to-utf16 translator as part of this work.