-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-37966: Fully implement the UAX #15 quick-check algorithm. #15558
Changes from 3 commits
4025110
2a222da
26892d3
27e8122
3762787
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
The implementation of :func:`~unicodedata.is_normalized` has been greatly | ||
sped up on strings that aren't normalized, by implementing the full | ||
normalization-quick-check algorithm from the Unicode standard. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -775,25 +775,42 @@ nfc_nfkc(PyObject *self, PyObject *input, int k) | |
return result; | ||
} | ||
|
||
typedef enum {YES, NO, MAYBE} NormalMode; | ||
|
||
/* Return YES if the input is certainly normalized, NO or MAYBE if it might not be. */ | ||
static NormalMode | ||
is_normalized(PyObject *self, PyObject *input, int nfc, int k) | ||
// This needs to match the logic in makeunicodedata.py | ||
// which constructs the quickcheck data. | ||
typedef enum {YES = 0, MAYBE = 1, NO = 2} QuickcheckResult; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, good idea! Do you think it'd be cleaner to add that to this PR, or make a small separate one? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could be done latter, I think. |
||
|
||
/* Run the Unicode normalization "quickcheck" algorithm. | ||
* | ||
* Return YES or NO if quickcheck determines the input is certainly | ||
* normalized or certainly not, and MAYBE if quickcheck is unable to | ||
* tell. | ||
* | ||
* If `yes_only` is true, then return MAYBE as soon as we determine | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I'm not really a fan of this name; I'd be glad for a better one. The idea is that this flag says the caller doesn't care about NO vs. MAYBE; it only cares whether the answer is YES. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe invert the sense and call it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The algorithm in the standard is the Another aspect is that there is an asymmetry between a definite YES and a definite NO. In the More ideas just now: perhaps Or There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The API also surprised me. I suggest tho rename is_normalized_quickcheck() to is_normalized_impl() and add 2 functions: is_normalized() (yes_only=false) and is_normalized_quickcheck() (yes_only=true). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hmm, I like the idea of covering over some of the multiplication of flags with wrapper functions! I don't like quite these names, though. First, I don't think What the function does do is implement a quick but partial check to try to see if the string is normalized. Specifically it implements one that's defined in the standard, which the standard calls More precisely, the yes_only=false case implements that standard quick-check algorithm; the yes_only=true (or yes_or_maybe=true, etc.) case (which is the version in master) is a variation on it. I think the name "quickcheck" or Here's another possible set of names to use with the same idea:
Or if those names feel too long: How does that sound? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Another idea for the flag name (though its name becomes less important if we use Victor's idea of a pair of wrapper functions): There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It doesn't seem like anything is standing out as dramatically better. At least the semantics are clearly documented and easily read from the function. So, let's go with the current state. |
||
* the answer is not YES. | ||
*/ | ||
static QuickcheckResult | ||
is_normalized_quickcheck(PyObject *self, PyObject *input, | ||
int nfc, int k, int yes_only) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These "boolean int" parameters could be actual There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh wow it's the future! 🙂 I guess that's a C99 feature, and one of those we now accept using. Good to know. I'm not sure I want to convert the existing two parameters within this PR; it seems like that could muddy the diff a bit with changes that are orthogonal to what it's really doing. But I guess I can see how it looks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pushed a change to I was a bit surprised to find that I needed to add an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I made a quick further commit converting I'm ambivalent about adding it here. It expands the diff a bit -- was: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What makes the diffstat larger? Aren't you already touching all callers, since you added a parameter? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we want to avoid polluting public headers with a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Ah, fair enough. I think that's consistent with putting it in a generic internal-only header, though. Within the tree, the very few existing uses are either stdbool, or (in
Hmm true. I guess IWYU doesn't feel like how this codebase is mostly written... but maybe it's best to make it more so. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Here's the extra commit: gnprice@896c970f5 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, we can fix up the rest in a sequel. |
||
{ | ||
Py_ssize_t i, len; | ||
int kind; | ||
void *data; | ||
unsigned char prev_combining = 0, quickcheck_mask; | ||
/* This is an implementation of the following algorithm: | ||
https://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms | ||
See there for background. | ||
*/ | ||
gnprice marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
/* An older version of the database is requested, quickchecks must be | ||
disabled. */ | ||
if (self && UCD_Check(self)) | ||
return NO; | ||
|
||
/* The two quickcheck bits at this shift mean 0=Yes, 1=Maybe, 2=No, | ||
as described in http://unicode.org/reports/tr15/#Annex8. */ | ||
quickcheck_mask = 3 << ((nfc ? 4 : 0) + (k ? 2 : 0)); | ||
Py_ssize_t i, len; | ||
int kind; | ||
void *data; | ||
unsigned char prev_combining = 0; | ||
|
||
/* The two quickcheck bits at this shift have type QuickcheckResult. */ | ||
int quickcheck_shift = (nfc ? 4 : 0) + (k ? 2 : 0); | ||
|
||
QuickcheckResult result = YES; /* certainly normalized, unless we find something */ | ||
|
||
i = 0; | ||
kind = PyUnicode_KIND(input); | ||
|
@@ -802,16 +819,26 @@ is_normalized(PyObject *self, PyObject *input, int nfc, int k) | |
while (i < len) { | ||
Py_UCS4 ch = PyUnicode_READ(kind, data, i++); | ||
const _PyUnicode_DatabaseRecord *record = _getrecord_ex(ch); | ||
unsigned char combining = record->combining; | ||
unsigned char quickcheck = record->normalization_quick_check; | ||
|
||
if (quickcheck & quickcheck_mask) | ||
return MAYBE; /* this string might need normalization */ | ||
unsigned char combining = record->combining; | ||
if (combining && prev_combining > combining) | ||
return NO; /* non-canonical sort order, not normalized */ | ||
prev_combining = combining; | ||
|
||
unsigned char quickcheck_whole = record->normalization_quick_check; | ||
if (yes_only) { | ||
if (quickcheck_whole & (3 << quickcheck_shift)) | ||
return MAYBE; | ||
} else { | ||
switch ((quickcheck_whole >> quickcheck_shift) & 3) { | ||
case NO: | ||
return NO; | ||
case MAYBE: | ||
result = MAYBE; /* this string might need normalization */ | ||
} | ||
} | ||
} | ||
return YES; /* certainly normalized */ | ||
return result; | ||
} | ||
|
||
/*[clinic input] | ||
|
@@ -844,7 +871,7 @@ unicodedata_UCD_is_normalized_impl(PyObject *self, PyObject *form, | |
PyObject *result; | ||
int nfc = 0; | ||
int k = 0; | ||
NormalMode m; | ||
QuickcheckResult m; | ||
|
||
PyObject *cmp; | ||
int match = 0; | ||
|
@@ -867,7 +894,7 @@ unicodedata_UCD_is_normalized_impl(PyObject *self, PyObject *form, | |
return NULL; | ||
} | ||
|
||
m = is_normalized(self, input, nfc, k); | ||
m = is_normalized_quickcheck(self, input, nfc, k, 0); | ||
|
||
if (m == MAYBE) { | ||
cmp = (nfc ? nfc_nfkc : nfd_nfkd)(self, input, k); | ||
|
@@ -913,28 +940,28 @@ unicodedata_UCD_normalize_impl(PyObject *self, PyObject *form, | |
} | ||
|
||
if (_PyUnicode_EqualToASCIIId(form, &PyId_NFC)) { | ||
if (is_normalized(self, input, 1, 0) == YES) { | ||
if (is_normalized_quickcheck(self, input, 1, 0, 1) == YES) { | ||
Py_INCREF(input); | ||
return input; | ||
} | ||
return nfc_nfkc(self, input, 0); | ||
} | ||
if (_PyUnicode_EqualToASCIIId(form, &PyId_NFKC)) { | ||
if (is_normalized(self, input, 1, 1) == YES) { | ||
if (is_normalized_quickcheck(self, input, 1, 1, 1) == YES) { | ||
Py_INCREF(input); | ||
return input; | ||
} | ||
return nfc_nfkc(self, input, 1); | ||
} | ||
if (_PyUnicode_EqualToASCIIId(form, &PyId_NFD)) { | ||
if (is_normalized(self, input, 0, 0) == YES) { | ||
if (is_normalized_quickcheck(self, input, 0, 0, 1) == YES) { | ||
Py_INCREF(input); | ||
return input; | ||
} | ||
return nfd_nfkd(self, input, 0); | ||
} | ||
if (_PyUnicode_EqualToASCIIId(form, &PyId_NFKD)) { | ||
if (is_normalized(self, input, 0, 1) == YES) { | ||
if (is_normalized_quickcheck(self, input, 0, 1, 1) == YES) { | ||
Py_INCREF(input); | ||
return input; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh, I would support merging
test_normalization
into this file for clarity.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah; as it is, it actually caused me to think this function had evaded being tested at all -- I'd gone as far as to write up some tests myself before I spotted that other file.
A bit of context on someone's thinking in having a separate file can be found above at line 178:
I don't see why that means the test code should be in a separate file, though. There's already a try/skip mechanism to deal with the external file being unavailable. (I'm guessing that if I looked in the history I'd find that that mechanism wasn't there when the separate file was first added.)
Happy to send a PR to fold that file in.