Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-37966: Fully implement the UAX #15 quick-check algorithm. #15558

Merged
merged 5 commits into from
Sep 4, 2019
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions Doc/whatsnew/3.8.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1090,8 +1090,9 @@ unicodedata
<http://blog.unicode.org/2019/05/unicode-12-1-en.html>`_ release.

* New function :func:`~unicodedata.is_normalized` can be used to verify a string
is in a specific normal form. (Contributed by Max Belanger and David Euresti in
:issue:`32285`).
is in a specific normal form, often much faster than by actually normalizing
the string. (Contributed by Max Belanger, David Euresti, and Greg Price in
:issue:`32285` and :issue:`37966`).


unittest
Expand Down
2 changes: 2 additions & 0 deletions Lib/test/test_unicodedata.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,8 @@ def test_issue29456(self):
self.assertEqual(self.db.normalize('NFC', u11a7_str_a), u11a7_str_b)
self.assertEqual(self.db.normalize('NFC', u11c3_str_a), u11c3_str_b)

# For tests of unicodedata.is_normalized / self.db.is_normalized ,
# see test_normalization.py .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, I would support merging test_normalization into this file for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah; as it is, it actually caused me to think this function had evaded being tested at all -- I'd gone as far as to write up some tests myself before I spotted that other file.

A bit of context on someone's thinking in having a separate file can be found above at line 178:

        # The rest can be found in test_normalization.py
        # which requires an external file.

I don't see why that means the test code should be in a separate file, though. There's already a try/skip mechanism to deal with the external file being unavailable. (I'm guessing that if I looked in the history I'd find that that mechanism wasn't there when the separate file was first added.)

Happy to send a PR to fold that file in.


def test_east_asian_width(self):
eaw = self.db.east_asian_width
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
The implementation of :func:`~unicodedata.is_normalized` has been greatly
sped up on strings that aren't normalized, by implementing the full
normalization-quick-check algorithm from the Unicode standard.
73 changes: 50 additions & 23 deletions Modules/unicodedata.c
Original file line number Diff line number Diff line change
Expand Up @@ -775,25 +775,42 @@ nfc_nfkc(PyObject *self, PyObject *input, int k)
return result;
}

typedef enum {YES, NO, MAYBE} NormalMode;

/* Return YES if the input is certainly normalized, NO or MAYBE if it might not be. */
static NormalMode
is_normalized(PyObject *self, PyObject *input, int nfc, int k)
// This needs to match the logic in makeunicodedata.py
// which constructs the quickcheck data.
typedef enum {YES = 0, MAYBE = 1, NO = 2} QuickcheckResult;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe makeunicodedata.py should output this enum (with better name namespacing) then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good idea!

Do you think it'd be cleaner to add that to this PR, or make a small separate one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be done latter, I think.


/* Run the Unicode normalization "quickcheck" algorithm.
*
* Return YES or NO if quickcheck determines the input is certainly
* normalized or certainly not, and MAYBE if quickcheck is unable to
* tell.
*
* If `yes_only` is true, then return MAYBE as soon as we determine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes_only is a bizarre name for a parameter that causes the function to return MAYBE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not really a fan of this name; I'd be glad for a better one.

The idea is that this flag says the caller doesn't care about NO vs. MAYBE; it only cares whether the answer is YES.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe invert the sense and call it want_definite meaning "process the string until we're sure it's not normalized or we reach the end"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The algorithm in the standard is the yes_only = false (or want_definite = true) case, so I think things are simplest to describe if that's the baseline and the other case is described by contrast with it. Could certainly handle that with some added words, though.

Another aspect is that there is an asymmetry between a definite YES and a definite NO. In the yes_only / !want_definite case, what the caller really most wants is a definite YES... it's just a definite NO that is no more helpful than a MAYBE. Or another angle on that: if the caller really didn't care about getting a definite answer, it could just assume MAYBE and not bother calling the helper 😉 It's in the hope of getting a YES that it makes the call.

More ideas just now: perhaps maybe_is_no, to mean "a MAYBE is no worse than a NO"?

Or yes_or_maybe, for "there are only two distinct answers, YES and MAYBE". That's basically what I was going for with yes_only, but probably a less paradoxical way of putting it 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API also surprised me. I suggest tho rename is_normalized_quickcheck() to is_normalized_impl() and add 2 functions: is_normalized() (yes_only=false) and is_normalized_quickcheck() (yes_only=true).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to rename is_normalized_quickcheck() to is_normalized_impl() and add 2 functions: is_normalized() (yes_only=false) and is_normalized_quickcheck() (yes_only=true).

Hmm, I like the idea of covering over some of the multiplication of flags with wrapper functions!

I don't like quite these names, though. First, I don't think is_normalized is a good name, for either value of the flag, because the function doesn't actually take full responsibility for finding out if the string is normalized -- it can return MAYBE, and then it's up to the caller to go and normalize the string and compare.

What the function does do is implement a quick but partial check to try to see if the string is normalized. Specifically it implements one that's defined in the standard, which the standard calls quickCheck in its sample code, and which uses a property in the UCD called Quick_Check.

More precisely, the yes_only=false case implements that standard quick-check algorithm; the yes_only=true (or yes_or_maybe=true, etc.) case (which is the version in master) is a variation on it. I think the name "quickcheck" or is_normalized_quickcheck, with no further qualifications, sounds like it will mean the standard's algorithm and it's confusing if it means a different algorithm instead.

Here's another possible set of names to use with the same idea:

  • rename this function to is_normalized_quickcheck_impl
  • is_normalized_quickcheck is the standard's quickcheck (yes_only/yes_or_maybe=false)
  • is_normalized_quickercheck is the variant that's more impatient to return and hand responsibility back to the caller (yes_or_maybe=true)
    • Or is_normalized_quickcheck_quicker

Or if those names feel too long: is_normalized_qc_impl, is_normalized_qc, is_normalized_qc_quicker.

How does that sound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or yes_or_maybe, for "there are only two distinct answers, YES and MAYBE".

Another idea for the flag name (though its name becomes less important if we use Victor's idea of a pair of wrapper functions): yes_vs_maybe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like anything is standing out as dramatically better. At least the semantics are clearly documented and easily read from the function.

So, let's go with the current state.

* the answer is not YES.
*/
static QuickcheckResult
is_normalized_quickcheck(PyObject *self, PyObject *input,
int nfc, int k, int yes_only)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These "boolean int" parameters could be actual bools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow it's the future! 🙂 I guess that's a C99 feature, and one of those we now accept using. Good to know.

I'm not sure I want to convert the existing two parameters within this PR; it seems like that could muddy the diff a bit with changes that are orthogonal to what it's really doing. But I guess I can see how it looks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a change to bool for the new parameter.

I was a bit surprised to find that I needed to add an #include <stdbool.h> line to the file! Perhaps that should be in a central include file, in the same way as limits.h and inttypes.h are. That's certainly orthogonal to this PR, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a quick further commit converting nfc and k to bool too (including in the callers, and using true/false rather than 1/0 for literal values.)

I'm ambivalent about adding it here. It expands the diff a bit -- was:
1 file changed, 51 insertions(+), 24 deletions(-)
and becomes:
1 file changed, 57 insertions(+), 30 deletions(-)
Slightly inclined to leave it out, but if you'd rather include it I'll do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes the diffstat larger? Aren't you already touching all callers, since you added a parameter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to avoid polluting public headers with a stdbool.h include—bool is a bit of a name grab in arbitrary mature C codebases. Using stdbool.h where it's needed is also consistent with IWYU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to avoid polluting public headers with a stdbool.h include—bool is a bit of a name grab in arbitrary mature C codebases.

Ah, fair enough. I think that's consistent with putting it in a generic internal-only header, though. Within the tree, the very few existing uses are either stdbool, or (in Modules/expat/xmltok.c) a sort of manual stdbool.

Using stdbool.h where it's needed is also consistent with IWYU.

Hmm true. I guess IWYU doesn't feel like how this codebase is mostly written... but maybe it's best to make it more so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes the diffstat larger? Aren't you already touching all callers, since you added a parameter?

Here's the extra commit: gnprice@896c970f5
The lines not otherwise touched are where we declare and then fiddle with the nfc and k locals in unicodedata_UCD_is_normalized_impl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, we can fix up the rest in a sequel.

{
Py_ssize_t i, len;
int kind;
void *data;
unsigned char prev_combining = 0, quickcheck_mask;
/* This is an implementation of the following algorithm:
https://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms
See there for background.
*/
gnprice marked this conversation as resolved.
Show resolved Hide resolved

/* An older version of the database is requested, quickchecks must be
disabled. */
if (self && UCD_Check(self))
return NO;

/* The two quickcheck bits at this shift mean 0=Yes, 1=Maybe, 2=No,
as described in http://unicode.org/reports/tr15/#Annex8. */
quickcheck_mask = 3 << ((nfc ? 4 : 0) + (k ? 2 : 0));
Py_ssize_t i, len;
int kind;
void *data;
unsigned char prev_combining = 0;

/* The two quickcheck bits at this shift have type QuickcheckResult. */
int quickcheck_shift = (nfc ? 4 : 0) + (k ? 2 : 0);

QuickcheckResult result = YES; /* certainly normalized, unless we find something */

i = 0;
kind = PyUnicode_KIND(input);
Expand All @@ -802,16 +819,26 @@ is_normalized(PyObject *self, PyObject *input, int nfc, int k)
while (i < len) {
Py_UCS4 ch = PyUnicode_READ(kind, data, i++);
const _PyUnicode_DatabaseRecord *record = _getrecord_ex(ch);
unsigned char combining = record->combining;
unsigned char quickcheck = record->normalization_quick_check;

if (quickcheck & quickcheck_mask)
return MAYBE; /* this string might need normalization */
unsigned char combining = record->combining;
if (combining && prev_combining > combining)
return NO; /* non-canonical sort order, not normalized */
prev_combining = combining;

unsigned char quickcheck_whole = record->normalization_quick_check;
if (yes_only) {
if (quickcheck_whole & (3 << quickcheck_shift))
return MAYBE;
} else {
switch ((quickcheck_whole >> quickcheck_shift) & 3) {
case NO:
return NO;
case MAYBE:
result = MAYBE; /* this string might need normalization */
}
}
}
return YES; /* certainly normalized */
return result;
}

/*[clinic input]
Expand Down Expand Up @@ -844,7 +871,7 @@ unicodedata_UCD_is_normalized_impl(PyObject *self, PyObject *form,
PyObject *result;
int nfc = 0;
int k = 0;
NormalMode m;
QuickcheckResult m;

PyObject *cmp;
int match = 0;
Expand All @@ -867,7 +894,7 @@ unicodedata_UCD_is_normalized_impl(PyObject *self, PyObject *form,
return NULL;
}

m = is_normalized(self, input, nfc, k);
m = is_normalized_quickcheck(self, input, nfc, k, 0);

if (m == MAYBE) {
cmp = (nfc ? nfc_nfkc : nfd_nfkd)(self, input, k);
Expand Down Expand Up @@ -913,28 +940,28 @@ unicodedata_UCD_normalize_impl(PyObject *self, PyObject *form,
}

if (_PyUnicode_EqualToASCIIId(form, &PyId_NFC)) {
if (is_normalized(self, input, 1, 0) == YES) {
if (is_normalized_quickcheck(self, input, 1, 0, 1) == YES) {
Py_INCREF(input);
return input;
}
return nfc_nfkc(self, input, 0);
}
if (_PyUnicode_EqualToASCIIId(form, &PyId_NFKC)) {
if (is_normalized(self, input, 1, 1) == YES) {
if (is_normalized_quickcheck(self, input, 1, 1, 1) == YES) {
Py_INCREF(input);
return input;
}
return nfc_nfkc(self, input, 1);
}
if (_PyUnicode_EqualToASCIIId(form, &PyId_NFD)) {
if (is_normalized(self, input, 0, 0) == YES) {
if (is_normalized_quickcheck(self, input, 0, 0, 1) == YES) {
Py_INCREF(input);
return input;
}
return nfd_nfkd(self, input, 0);
}
if (_PyUnicode_EqualToASCIIId(form, &PyId_NFKD)) {
if (is_normalized(self, input, 0, 1) == YES) {
if (is_normalized_quickcheck(self, input, 0, 1, 1) == YES) {
Py_INCREF(input);
return input;
}
Expand Down