bpo-38043: Move unicodedata.normalize tests into test_unicodedata. #15712

gnprice · 2019-09-06T05:24:23Z

Having these in a separate file from the one that's named after the
module in the usual way makes it very easy to miss them when looking
for tests for these two functions.

(In fact when working recently on is_normalized, I'd been surprised to
see no tests for it here and concluded the function had evaded being
tested at all. I'd gone as far as to write up some tests myself
before I spotted this other file.)

Mostly this just means moving all the one file's code into the other,
and moving code from the module toplevel to inside the test class to
keep it tidily separate from the rest of the file's code.

There's one substantive change, which reduces by a bit the amount of
code to be moved: we drop the x > sys.maxunicode conditional and all
the RangeError logic behind it. Now if that condition ever occurs
it will cause an error at chr(x), and a test failure. That's the
right result because, since PEP 393 in Python 3.3, there is no longer
such a thing as an "unsupported character".

https://bugs.python.org/issue38043

Having these in a separate file from the one that's named after the module in the usual way makes it very easy to miss them when looking for tests for these two functions. (In fact when working recently on is_normalized, I'd been surprised to see no tests for it here and concluded the function had evaded being tested at all. I'd gone as far as to write up some tests myself before I spotted this other file.) Mostly this just means moving all the one file's code into the other, and moving code from the module toplevel to inside the test class to keep it tidily separate from the rest of the file's code. There's one substantive change, which reduces by a bit the amount of code to be moved: we drop the `x > sys.maxunicode` conditional and all the `RangeError` logic behind it. Now if that condition ever occurs it will cause an error at `chr(x)`, and a test failure. That's the right result because, since PEP 393 in Python 3.3, there is no longer such a thing as an "unsupported character".

aeros

Thanks for the PR @gnprice. As far as I'm aware, we typically don't require news entries for changes that exclusively involve the tests, so for now I've added the skip news label. I'll also CC the experts for unicodedata.

/cc @malemburg @ezio-melotti

aeros

I have a few minor suggestions that wouldn't be worth opening their own PR for, but might be worth addressing while we're moving the tests.

aeros · 2019-09-07T10:08:50Z

Lib/test/test_ucn.py

@@ -172,6 +171,9 @@ def test_named_sequences_sample(self):

    def test_named_sequences_full(self):
        # Check all the named sequences
+        def check_version(testfile):
+            hdr = testfile.readline()


Suggested change

hdr = testfile.readline()

header = testfile.readline()

Is "hdr" short for header here? I don't recall seeing this particular abbreviation used before, so it's not entirely clear to me. But, it seems to be the only thing that would make sense in this context. Unless this is a very common and well known abbreviation that I wasn't aware of, it doesn't seem to be worth saving the extra three characters. It looks like it's only used a total of 3 times within these tests, so changing the variable name shouldn't be an issue. Explicit > Implicit.

Yep, I'm pretty sure it's intended to mean "header". I'd write the full word in this context if I were writing this code new.

(General reason I've left it this way is in the top-level comment above.)

Fortunately because this variable is local to this small and simple bit of code, there isn't much the reader has to look at to see what it does. Even if the variable were named "x" or "var31", the code wouldn't be made that much harder to read. Maybe even easier, as you wouldn't be tempted to try to guess what the name meant 😉

aeros · 2019-09-07T10:11:23Z

Lib/test/test_unicodedata.py

+            self.assertTrue(unicodedata.is_normalized("NFD", c5))
+
+            self.assertTrue(unicodedata.is_normalized("NFKC", c4))
+            self.assertTrue(unicodedata.is_normalized("NFKD", c5))


Trivial nit: The amount of whitespace in this section seems a bit excessive. I would recommend condensing it down to something like this:

self.assertTrue(unicodedata.is_normalized("NFC", c2)) self.assertTrue(unicodedata.is_normalized("NFC", c4)) self.assertTrue(unicodedata.is_normalized("NFD", c3)) self.assertTrue(unicodedata.is_normalized("NFD", c5)) self.assertTrue(unicodedata.is_normalized("NFKC", c4)) self.assertTrue(unicodedata.is_normalized("NFKD", c5))

aeros · 2019-09-07T10:17:43Z

Lib/test/test_unicodedata.py

@@ -175,8 +176,7 @@ def test_normalize(self):
        self.assertRaises(TypeError, self.db.normalize)
        self.assertRaises(ValueError, self.db.normalize, 'unknown', 'xx')
        self.assertEqual(self.db.normalize('NFKC', ''), '')
-        # The rest can be found in test_normalization.py
-        # which requires an external file.
+        # See also comprehensive tests in NormalizationTest below.


Suggested change

# See also comprehensive tests in NormalizationTest below.

I agree with the removal of the previous comment since this has been condensed into one file, but I don't think the new comment is needed.

Not strictly necessary, but I think there's a possible confusion where you're looking through all these tests which test one function after another, and this fits right in, and so you might naturally infer that this is the test (or the main test) for this function, and not go looking for others. So I think a bit of explicitness remains helpful.

Not strictly necessary, but I think there's a possible confusion where you're looking through all these tests which test one function after another, and this fits right in, and so you might naturally infer that this is the test (or the main test) for this function, and not go looking for others.

Hmm, fair point. That could potentially be useful to point out to readers.

I'm not completely against this comment remaining, but I think the other two (especially "Perform tests" on line 370) should be removed. Comments that don't provide additional information or context end up adding unneeded noise/clutter to the code. Each individual unneeded comment isn't much of an issue, but it can definitely add up over time.

I would suggest completing the unification by putting this case in NormalizationCases (and renaming it test_edge_cases or something; that's what it seems to be about).

Sure -- done.

aeros · 2019-09-07T10:30:52Z

Lib/test/test_unicodedata.py

+                continue
+            c1,c2,c3,c4,c5 = [self.unistr(x) for x in line.split(';')[:-1]]
+
+            # Perform tests


Suggested change

# Perform tests

I don't see much of a purpose in this comment, it doesn't seem to provide any additional information or context for the assertions below it.

aeros · 2019-09-07T10:33:49Z

Lib/test/test_unicodedata.py

+        TESTDATAFILE = "NormalizationTest.txt"
+        TESTDATAURL = f"http://www.pythontest.net/unicode/{unicodedata.unidata_version}/{TESTDATAFILE}"
+
+        # Hit the exception early


Suggested change

# Hit the exception early

Similar to my other recommendation, this comment doesn't seem to provide any particularly useful information or context.

gnprice

Thanks @aeros167 !

I agree with most of your comments in that if I were writing this code new, or reviewing it when newly-written, I'd prefer the code be the way you suggest.

I've left things like a short local-variable name, and the couple of unnecessary small comments, the way they are because I've largely taken a policy of leaving the code unchanged wherever possible. That's because I think that's helpful in reducing the work needed to read and review the change.

Especially when moving a good-sized chunk of code around, if a reader wants to 100% check my work for themselves, they'll have to go comparing the deleted and added code bit by bit and spot where it differs -- then study each of those differences to understand what they do. (Tools like git diff --color-moved help in skimming past what didn't change, but when something did change you still have to study that.) In this case there were a few things that did change, as described in the PR description. But it helps if that's kept to a minimum.

gnprice · 2019-09-09T03:45:50Z

Lib/test/test_ucn.py

@@ -172,6 +171,9 @@ def test_named_sequences_sample(self):

    def test_named_sequences_full(self):
        # Check all the named sequences
+        def check_version(testfile):
+            hdr = testfile.readline()


Yep, I'm pretty sure it's intended to mean "header". I'd write the full word in this context if I were writing this code new.

(General reason I've left it this way is in the top-level comment above.)

Fortunately because this variable is local to this small and simple bit of code, there isn't much the reader has to look at to see what it does. Even if the variable were named "x" or "var31", the code wouldn't be made that much harder to read. Maybe even easier, as you wouldn't be tempted to try to guess what the name meant 😉

gnprice · 2019-09-09T03:52:07Z

Lib/test/test_unicodedata.py

@@ -175,8 +176,7 @@ def test_normalize(self):
        self.assertRaises(TypeError, self.db.normalize)
        self.assertRaises(ValueError, self.db.normalize, 'unknown', 'xx')
        self.assertEqual(self.db.normalize('NFKC', ''), '')
-        # The rest can be found in test_normalization.py
-        # which requires an external file.
+        # See also comprehensive tests in NormalizationTest below.


Not strictly necessary, but I think there's a possible confusion where you're looking through all these tests which test one function after another, and this fits right in, and so you might naturally infer that this is the test (or the main test) for this function, and not go looking for others. So I think a bit of explicitness remains helpful.

gnprice · 2019-09-09T04:25:03Z

(also for ease of cross-reference here's where @benjaminp originally suggested this change: #15558 (comment) )

aeros · 2019-09-09T07:13:26Z

@gnprice:

I've left things like a short local-variable name, and the couple of unnecessary small comments, the way they are because I've largely taken a policy of leaving the code unchanged wherever possible. That's because I think that's helpful in reducing the work needed to read and review the change.

I can definitely understand the logic here, the less changes involved in a PR the easier it is to review. However, I think some of the minor refactoring is needed over time. If everyone were to follow the mentality of leaving the surrounding code as unchanged as possible, these minor issues would add up to be more significant over time.

I wouldn't mind attempting to apply these minor changes in a separate PR, but it can be incredibly difficult to find a core developer with the time to review those types of changes by themselves. From a review perspective, it's usually easier to group them in with other changes. That's why I usually try to suggest for several of them to be addressed when a PR happens to be modifying (or moving in this case) the surrounding code.

It's a fine balance though, and I certainly don't expect all of my suggestions to make it to the final merged PR. As long as some of them are addressed, it provides a gradual improvement in the long term.

benjaminp

I think it's fine to submit a PR that does the minimal move because it eases review and clarifies version history.

benjaminp · 2019-09-09T10:32:05Z

Lib/test/test_unicodedata.py

@@ -175,8 +176,7 @@ def test_normalize(self):
        self.assertRaises(TypeError, self.db.normalize)
        self.assertRaises(ValueError, self.db.normalize, 'unknown', 'xx')
        self.assertEqual(self.db.normalize('NFKC', ''), '')
-        # The rest can be found in test_normalization.py
-        # which requires an external file.
+        # See also comprehensive tests in NormalizationTest below.


I would suggest completing the unification by putting this case in NormalizationCases (and renaming it test_edge_cases or something; that's what it seems to be about).

gnprice · 2019-09-10T07:00:02Z

but it can be incredibly difficult to find a core developer with the time to review those types of changes by themselves. From a review perspective, it's usually easier to group them in with other changes.

Yeah, you've put your finger on a real tension here.

Thing is, while it is easier to get small code cleanups in when they're squashed into some bigger, otherwise-independent change... I think it's actually more review work that way. (For one-line cleanups like deleting those unhelpful comments, the stakes are low all around, but I think this is clearer for cleanups that are just a little more complex -- and often also higher-value.)

So the current equilibrium of how people write and send PRs, how they review them, and how they merge them, naturally pushes you in that direction; but I think it's actually one of the contributing factors in the shortage we have of review/merge effort. The thing about being an equilibrium is that it's infeasible to totally opt out -- this PR already contained what would have been at least three separate changes if we were in a workflow equilibrium where many, small, focused changes was the norm -- but I try to lean in that direction.

Which also includes sending some small cleanups as PRs of their own. One such was def97c9 (in this same file, as it happens). One merged just today was 3cbc23a -- waited a few weeks, but it's just as good now as then, and I expect it was very little work for the reviewers/merger. If you run git log --stat -p --author=gnprice in your clone of the repo, you'll see more examples. Looking through the logs in general, there are others as small as 3aa48b8.

aeros · 2019-09-10T07:03:49Z

@benjaminp:

I think it's fine to submit a PR that does the minimal move because it eases review and clarifies version history.

Would you suggest having code comment cleanups in their own PR then? My main concern was that it would end up getting buried. Code comments would generally fall under documentation, and the currently active doc experts are stretched thin as is, in comparison to the volume of doc-only PRs.

In theory, any core dev could review doc PRs, but most tend to be focused on their areas of expertise/interest. This completely makes sense and I definitely don't blame anyone for the situation, but it makes it rather difficult to have them reviewed/merged.

This also applies to rather small PRs, see #14698 as an example. I currently have 3 doc-only PRs open and have several other ideas for ones that I'd like to open, but I wanted to avoid adding further to the list of doc PRs that are in need review from a core developer.

We could use some more active documentation experts, but that's definitely easier said than done. I would be interested in helping with that issue, but I think I need more experience and mentoring first. I only recently became a member of the Triage team last month and started actively contributing in June.

aeros · 2019-09-10T07:23:50Z

@gnprice:

Thing is, while it is easier to get small code cleanups in when they're squashed into some bigger, otherwise-independent change... I think it's actually more review work that way.

So the current equilibrium of how people write and send PRs, how they review them, and how they merge them, naturally pushes you in that direction; but I think it's actually one of the contributing factors in the shortage we have of review/merge effort

Hmm, interesting. I hadn't considered that but I think you might be correct. I think a significant part of my push in that direction has been the slower response times to documentation only PRs (which I detailed more in my comment above). I've tried to assist with alleviating this issue by focusing my efforts on PR review, but the main bottleneck seems to be on the final core dev review. Perhaps smaller and more specific PRs could help with that.

I'm uncertain though if I should try to leave the code cleanup and minor documentation PRs on the backlog for now while I wait for the others to be merged and continue to focus on review, or try to do more of a balance of both. Do you have any recommendations?

benjaminp · 2019-09-10T09:28:12Z

@benjaminp:

I think it's fine to submit a PR that does the minimal move because it eases review and clarifies version history.

Would you suggest having code comment cleanups in their own PR then? My main concern was that it would end up getting buried. Code comments would generally fall under documentation, and the currently active doc experts are stretched thin as is, in comparison to the volume of doc-only PRs.

I don't think this problem is specific to doc-only PRs. In general, review time ≪ amount of PRs to review.

aeros · 2019-09-10T15:12:39Z

@benjaminp:

I don't think this problem is specific to doc-only PRs. In general, review time ≪ amount of PRs to review.

Oh yeah definitely, it's not an exclusive issue to doc-only PRs, but from my experience thus far it seems to be a bit more of an issue for them than it is for most other PRs. At least from my personal experience, I usually see significantly faster response times from CCing other experts than I do when trying to CC the documentation experts. There are some other areas which have longer response times than the documentation experts, but that's usually due to a lack of an active expert, such as for argparse.

gnprice · 2019-09-11T05:03:38Z

Thanks @benjaminp for the merge!

At least from my personal experience, I usually see significantly faster response times from CCing other experts than I do when trying to CC the documentation experts.

@aeros167 One thought this brings to mind: when submitting an issue or PR in the documentation for some area, you might get a response more easily by CCing experts in that area rather than in documentation generally.

I'd expect documentation experts to be most able to help for things in the docs infrastructure (e.g. the stuff in Doc/tools/), and things that cut across large parts of the documentation. (Plus of course all the other things the same person happens to be an expert on.) On the other hand if someone's taking care of e.g. a given stdlib module, I think they'd typically see that module's docs as part of what they're taking care of and are well-prepared to review.

…ythonGH-15712) Having these in a separate file from the one that's named after the module in the usual way makes it very easy to miss them when looking for tests for these two functions. (In fact when working recently on is_normalized, I'd been surprised to see no tests for it here and concluded the function had evaded being tested at all. I'd gone as far as to write up some tests myself before I spotted this other file.) Mostly this just means moving all the one file's code into the other, and moving code from the module toplevel to inside the test class to keep it tidily separate from the rest of the file's code. There's one substantive change, which reduces by a bit the amount of code to be moved: we drop the `x > sys.maxunicode` conditional and all the `RangeError` logic behind it. Now if that condition ever occurs it will cause an error at `chr(x)`, and a test failure. That's the right result because, since PEP 393 in Python 3.3, there is no longer such a thing as an "unsupported character".

gnprice requested a review from a team as a code owner September 6, 2019 05:24

the-knights-who-say-ni added the CLA signed label Sep 6, 2019

bedevere-bot added the awaiting review label Sep 6, 2019

aeros added skip news tests Tests in the Lib/test dir labels Sep 7, 2019

aeros reviewed Sep 7, 2019

View reviewed changes

aeros suggested changes Sep 7, 2019

View reviewed changes

bedevere-bot added awaiting core review and removed awaiting review labels Sep 7, 2019

gnprice commented Sep 9, 2019

View reviewed changes

benjaminp reviewed Sep 9, 2019

View reviewed changes

Fold old test_normalize function in too, as test_edge_cases.

8803950

benjaminp merged commit 1ad0c77 into python:master Sep 10, 2019

bedevere-bot removed the awaiting core review label Sep 10, 2019

gnprice deleted the pr-normalize-test-file branch September 11, 2019 04:51

gnprice mentioned this pull request Apr 10, 2022

small cleanups in Unicode normalization code #82224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-38043: Move unicodedata.normalize tests into test_unicodedata. #15712

bpo-38043: Move unicodedata.normalize tests into test_unicodedata. #15712

gnprice commented Sep 6, 2019 •

edited by bedevere-bot

Loading

aeros left a comment •

edited

Loading

aeros left a comment

aeros Sep 7, 2019

gnprice Sep 9, 2019

aeros Sep 7, 2019

aeros Sep 7, 2019

gnprice Sep 9, 2019

aeros Sep 9, 2019 •

edited

Loading

benjaminp Sep 9, 2019

gnprice Sep 10, 2019

aeros Sep 7, 2019

aeros Sep 7, 2019

gnprice left a comment

gnprice Sep 9, 2019

gnprice Sep 9, 2019

gnprice commented Sep 9, 2019

aeros commented Sep 9, 2019

benjaminp left a comment

benjaminp Sep 9, 2019

gnprice commented Sep 10, 2019

aeros commented Sep 10, 2019

aeros commented Sep 10, 2019

benjaminp commented Sep 10, 2019

aeros commented Sep 10, 2019

gnprice commented Sep 11, 2019

bpo-38043: Move unicodedata.normalize tests into test_unicodedata. #15712

bpo-38043: Move unicodedata.normalize tests into test_unicodedata. #15712

Conversation

gnprice commented Sep 6, 2019 • edited by bedevere-bot Loading

aeros left a comment • edited Loading

Choose a reason for hiding this comment

aeros left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aeros Sep 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnprice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnprice commented Sep 9, 2019

aeros commented Sep 9, 2019

benjaminp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnprice commented Sep 10, 2019

aeros commented Sep 10, 2019

aeros commented Sep 10, 2019

benjaminp commented Sep 10, 2019

aeros commented Sep 10, 2019

gnprice commented Sep 11, 2019

gnprice commented Sep 6, 2019 •

edited by bedevere-bot

Loading

aeros left a comment •

edited

Loading

aeros Sep 9, 2019 •

edited

Loading