-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-38043: Move unicodedata.normalize tests into test_unicodedata. #15712
Conversation
Having these in a separate file from the one that's named after the module in the usual way makes it very easy to miss them when looking for tests for these two functions. (In fact when working recently on is_normalized, I'd been surprised to see no tests for it here and concluded the function had evaded being tested at all. I'd gone as far as to write up some tests myself before I spotted this other file.) Mostly this just means moving all the one file's code into the other, and moving code from the module toplevel to inside the test class to keep it tidily separate from the rest of the file's code. There's one substantive change, which reduces by a bit the amount of code to be moved: we drop the `x > sys.maxunicode` conditional and all the `RangeError` logic behind it. Now if that condition ever occurs it will cause an error at `chr(x)`, and a test failure. That's the right result because, since PEP 393 in Python 3.3, there is no longer such a thing as an "unsupported character".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @gnprice. As far as I'm aware, we typically don't require news entries for changes that exclusively involve the tests, so for now I've added the skip news
label. I'll also CC the experts for unicodedata
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few minor suggestions that wouldn't be worth opening their own PR for, but might be worth addressing while we're moving the tests.
@@ -172,6 +171,9 @@ def test_named_sequences_sample(self): | |||
|
|||
def test_named_sequences_full(self): | |||
# Check all the named sequences | |||
def check_version(testfile): | |||
hdr = testfile.readline() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hdr = testfile.readline() | |
header = testfile.readline() |
Is "hdr" short for header here? I don't recall seeing this particular abbreviation used before, so it's not entirely clear to me. But, it seems to be the only thing that would make sense in this context. Unless this is a very common and well known abbreviation that I wasn't aware of, it doesn't seem to be worth saving the extra three characters. It looks like it's only used a total of 3 times within these tests, so changing the variable name shouldn't be an issue. Explicit > Implicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'm pretty sure it's intended to mean "header". I'd write the full word in this context if I were writing this code new.
(General reason I've left it this way is in the top-level comment above.)
Fortunately because this variable is local to this small and simple bit of code, there isn't much the reader has to look at to see what it does. Even if the variable were named "x" or "var31", the code wouldn't be made that much harder to read. Maybe even easier, as you wouldn't be tempted to try to guess what the name meant 😉
self.assertTrue(unicodedata.is_normalized("NFD", c5)) | ||
|
||
self.assertTrue(unicodedata.is_normalized("NFKC", c4)) | ||
self.assertTrue(unicodedata.is_normalized("NFKD", c5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trivial nit: The amount of whitespace in this section seems a bit excessive. I would recommend condensing it down to something like this:
self.assertTrue(unicodedata.is_normalized("NFC", c2))
self.assertTrue(unicodedata.is_normalized("NFC", c4))
self.assertTrue(unicodedata.is_normalized("NFD", c3))
self.assertTrue(unicodedata.is_normalized("NFD", c5))
self.assertTrue(unicodedata.is_normalized("NFKC", c4))
self.assertTrue(unicodedata.is_normalized("NFKD", c5))
Lib/test/test_unicodedata.py
Outdated
@@ -175,8 +176,7 @@ def test_normalize(self): | |||
self.assertRaises(TypeError, self.db.normalize) | |||
self.assertRaises(ValueError, self.db.normalize, 'unknown', 'xx') | |||
self.assertEqual(self.db.normalize('NFKC', ''), '') | |||
# The rest can be found in test_normalization.py | |||
# which requires an external file. | |||
# See also comprehensive tests in NormalizationTest below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# See also comprehensive tests in NormalizationTest below. |
I agree with the removal of the previous comment since this has been condensed into one file, but I don't think the new comment is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly necessary, but I think there's a possible confusion where you're looking through all these tests which test one function after another, and this fits right in, and so you might naturally infer that this is the test (or the main test) for this function, and not go looking for others. So I think a bit of explicitness remains helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly necessary, but I think there's a possible confusion where you're looking through all these tests which test one function after another, and this fits right in, and so you might naturally infer that this is the test (or the main test) for this function, and not go looking for others.
Hmm, fair point. That could potentially be useful to point out to readers.
I'm not completely against this comment remaining, but I think the other two (especially "Perform tests" on line 370) should be removed. Comments that don't provide additional information or context end up adding unneeded noise/clutter to the code. Each individual unneeded comment isn't much of an issue, but it can definitely add up over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest completing the unification by putting this case in NormalizationCases
(and renaming it test_edge_cases
or something; that's what it seems to be about).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure -- done.
continue | ||
c1,c2,c3,c4,c5 = [self.unistr(x) for x in line.split(';')[:-1]] | ||
|
||
# Perform tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Perform tests |
I don't see much of a purpose in this comment, it doesn't seem to provide any additional information or context for the assertions below it.
TESTDATAFILE = "NormalizationTest.txt" | ||
TESTDATAURL = f"http://www.pythontest.net/unicode/{unicodedata.unidata_version}/{TESTDATAFILE}" | ||
|
||
# Hit the exception early |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Hit the exception early |
Similar to my other recommendation, this comment doesn't seem to provide any particularly useful information or context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aeros167 !
I agree with most of your comments in that if I were writing this code new, or reviewing it when newly-written, I'd prefer the code be the way you suggest.
I've left things like a short local-variable name, and the couple of unnecessary small comments, the way they are because I've largely taken a policy of leaving the code unchanged wherever possible. That's because I think that's helpful in reducing the work needed to read and review the change.
Especially when moving a good-sized chunk of code around, if a reader wants to 100% check my work for themselves, they'll have to go comparing the deleted and added code bit by bit and spot where it differs -- then study each of those differences to understand what they do. (Tools like git diff --color-moved
help in skimming past what didn't change, but when something did change you still have to study that.) In this case there were a few things that did change, as described in the PR description. But it helps if that's kept to a minimum.
@@ -172,6 +171,9 @@ def test_named_sequences_sample(self): | |||
|
|||
def test_named_sequences_full(self): | |||
# Check all the named sequences | |||
def check_version(testfile): | |||
hdr = testfile.readline() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'm pretty sure it's intended to mean "header". I'd write the full word in this context if I were writing this code new.
(General reason I've left it this way is in the top-level comment above.)
Fortunately because this variable is local to this small and simple bit of code, there isn't much the reader has to look at to see what it does. Even if the variable were named "x" or "var31", the code wouldn't be made that much harder to read. Maybe even easier, as you wouldn't be tempted to try to guess what the name meant 😉
Lib/test/test_unicodedata.py
Outdated
@@ -175,8 +176,7 @@ def test_normalize(self): | |||
self.assertRaises(TypeError, self.db.normalize) | |||
self.assertRaises(ValueError, self.db.normalize, 'unknown', 'xx') | |||
self.assertEqual(self.db.normalize('NFKC', ''), '') | |||
# The rest can be found in test_normalization.py | |||
# which requires an external file. | |||
# See also comprehensive tests in NormalizationTest below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly necessary, but I think there's a possible confusion where you're looking through all these tests which test one function after another, and this fits right in, and so you might naturally infer that this is the test (or the main test) for this function, and not go looking for others. So I think a bit of explicitness remains helpful.
(also for ease of cross-reference here's where @benjaminp originally suggested this change: #15558 (comment) ) |
I can definitely understand the logic here, the less changes involved in a PR the easier it is to review. However, I think some of the minor refactoring is needed over time. If everyone were to follow the mentality of leaving the surrounding code as unchanged as possible, these minor issues would add up to be more significant over time. I wouldn't mind attempting to apply these minor changes in a separate PR, but it can be incredibly difficult to find a core developer with the time to review those types of changes by themselves. From a review perspective, it's usually easier to group them in with other changes. That's why I usually try to suggest for several of them to be addressed when a PR happens to be modifying (or moving in this case) the surrounding code. It's a fine balance though, and I certainly don't expect all of my suggestions to make it to the final merged PR. As long as some of them are addressed, it provides a gradual improvement in the long term. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to submit a PR that does the minimal move because it eases review and clarifies version history.
Lib/test/test_unicodedata.py
Outdated
@@ -175,8 +176,7 @@ def test_normalize(self): | |||
self.assertRaises(TypeError, self.db.normalize) | |||
self.assertRaises(ValueError, self.db.normalize, 'unknown', 'xx') | |||
self.assertEqual(self.db.normalize('NFKC', ''), '') | |||
# The rest can be found in test_normalization.py | |||
# which requires an external file. | |||
# See also comprehensive tests in NormalizationTest below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest completing the unification by putting this case in NormalizationCases
(and renaming it test_edge_cases
or something; that's what it seems to be about).
Yeah, you've put your finger on a real tension here. Thing is, while it is easier to get small code cleanups in when they're squashed into some bigger, otherwise-independent change... I think it's actually more review work that way. (For one-line cleanups like deleting those unhelpful comments, the stakes are low all around, but I think this is clearer for cleanups that are just a little more complex -- and often also higher-value.) So the current equilibrium of how people write and send PRs, how they review them, and how they merge them, naturally pushes you in that direction; but I think it's actually one of the contributing factors in the shortage we have of review/merge effort. The thing about being an equilibrium is that it's infeasible to totally opt out -- this PR already contained what would have been at least three separate changes if we were in a workflow equilibrium where many, small, focused changes was the norm -- but I try to lean in that direction. Which also includes sending some small cleanups as PRs of their own. One such was def97c9 (in this same file, as it happens). One merged just today was 3cbc23a -- waited a few weeks, but it's just as good now as then, and I expect it was very little work for the reviewers/merger. If you run |
Would you suggest having code comment cleanups in their own PR then? My main concern was that it would end up getting buried. Code comments would generally fall under documentation, and the currently active doc experts are stretched thin as is, in comparison to the volume of doc-only PRs. In theory, any core dev could review doc PRs, but most tend to be focused on their areas of expertise/interest. This completely makes sense and I definitely don't blame anyone for the situation, but it makes it rather difficult to have them reviewed/merged. This also applies to rather small PRs, see #14698 as an example. I currently have 3 doc-only PRs open and have several other ideas for ones that I'd like to open, but I wanted to avoid adding further to the list of doc PRs that are in need review from a core developer. We could use some more active documentation experts, but that's definitely easier said than done. I would be interested in helping with that issue, but I think I need more experience and mentoring first. I only recently became a member of the Triage team last month and started actively contributing in June. |
Hmm, interesting. I hadn't considered that but I think you might be correct. I think a significant part of my push in that direction has been the slower response times to documentation only PRs (which I detailed more in my comment above). I've tried to assist with alleviating this issue by focusing my efforts on PR review, but the main bottleneck seems to be on the final core dev review. Perhaps smaller and more specific PRs could help with that. I'm uncertain though if I should try to leave the code cleanup and minor documentation PRs on the backlog for now while I wait for the others to be merged and continue to focus on review, or try to do more of a balance of both. Do you have any recommendations? |
I don't think this problem is specific to doc-only PRs. In general, review time ≪ amount of PRs to review. |
Oh yeah definitely, it's not an exclusive issue to doc-only PRs, but from my experience thus far it seems to be a bit more of an issue for them than it is for most other PRs. At least from my personal experience, I usually see significantly faster response times from CCing other experts than I do when trying to CC the documentation experts. There are some other areas which have longer response times than the documentation experts, but that's usually due to a lack of an active expert, such as for |
Thanks @benjaminp for the merge!
@aeros167 One thought this brings to mind: when submitting an issue or PR in the documentation for some area, you might get a response more easily by CCing experts in that area rather than in documentation generally. I'd expect documentation experts to be most able to help for things in the docs infrastructure (e.g. the stuff in |
…ythonGH-15712) Having these in a separate file from the one that's named after the module in the usual way makes it very easy to miss them when looking for tests for these two functions. (In fact when working recently on is_normalized, I'd been surprised to see no tests for it here and concluded the function had evaded being tested at all. I'd gone as far as to write up some tests myself before I spotted this other file.) Mostly this just means moving all the one file's code into the other, and moving code from the module toplevel to inside the test class to keep it tidily separate from the rest of the file's code. There's one substantive change, which reduces by a bit the amount of code to be moved: we drop the `x > sys.maxunicode` conditional and all the `RangeError` logic behind it. Now if that condition ever occurs it will cause an error at `chr(x)`, and a test failure. That's the right result because, since PEP 393 in Python 3.3, there is no longer such a thing as an "unsupported character".
Having these in a separate file from the one that's named after the
module in the usual way makes it very easy to miss them when looking
for tests for these two functions.
(In fact when working recently on is_normalized, I'd been surprised to
see no tests for it here and concluded the function had evaded being
tested at all. I'd gone as far as to write up some tests myself
before I spotted this other file.)
Mostly this just means moving all the one file's code into the other,
and moving code from the module toplevel to inside the test class to
keep it tidily separate from the rest of the file's code.
There's one substantive change, which reduces by a bit the amount of
code to be moved: we drop the
x > sys.maxunicode
conditional and allthe
RangeError
logic behind it. Now if that condition ever occursit will cause an error at
chr(x)
, and a test failure. That's theright result because, since PEP 393 in Python 3.3, there is no longer
such a thing as an "unsupported character".
https://bugs.python.org/issue38043