Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-38043: Move unicodedata.normalize tests into test_unicodedata. #15712

Merged
merged 2 commits into from
Sep 10, 2019

Conversation

gnprice
Copy link
Contributor

@gnprice gnprice commented Sep 6, 2019

Having these in a separate file from the one that's named after the
module in the usual way makes it very easy to miss them when looking
for tests for these two functions.

(In fact when working recently on is_normalized, I'd been surprised to
see no tests for it here and concluded the function had evaded being
tested at all. I'd gone as far as to write up some tests myself
before I spotted this other file.)

Mostly this just means moving all the one file's code into the other,
and moving code from the module toplevel to inside the test class to
keep it tidily separate from the rest of the file's code.

There's one substantive change, which reduces by a bit the amount of
code to be moved: we drop the x > sys.maxunicode conditional and all
the RangeError logic behind it. Now if that condition ever occurs
it will cause an error at chr(x), and a test failure. That's the
right result because, since PEP 393 in Python 3.3, there is no longer
such a thing as an "unsupported character".

https://bugs.python.org/issue38043

Having these in a separate file from the one that's named after the
module in the usual way makes it very easy to miss them when looking
for tests for these two functions.

(In fact when working recently on is_normalized, I'd been surprised to
see no tests for it here and concluded the function had evaded being
tested at all.  I'd gone as far as to write up some tests myself
before I spotted this other file.)

Mostly this just means moving all the one file's code into the other,
and moving code from the module toplevel to inside the test class to
keep it tidily separate from the rest of the file's code.

There's one substantive change, which reduces by a bit the amount of
code to be moved: we drop the `x > sys.maxunicode` conditional and all
the `RangeError` logic behind it.  Now if that condition ever occurs
it will cause an error at `chr(x)`, and a test failure.  That's the
right result because, since PEP 393 in Python 3.3, there is no longer
such a thing as an "unsupported character".
@gnprice gnprice requested a review from a team as a code owner September 6, 2019 05:24
@aeros aeros added skip news tests Tests in the Lib/test dir labels Sep 7, 2019
Copy link
Contributor

@aeros aeros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @gnprice. As far as I'm aware, we typically don't require news entries for changes that exclusively involve the tests, so for now I've added the skip news label. I'll also CC the experts for unicodedata.

/cc @malemburg @ezio-melotti

Copy link
Contributor

@aeros aeros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few minor suggestions that wouldn't be worth opening their own PR for, but might be worth addressing while we're moving the tests.

@@ -172,6 +171,9 @@ def test_named_sequences_sample(self):

def test_named_sequences_full(self):
# Check all the named sequences
def check_version(testfile):
hdr = testfile.readline()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hdr = testfile.readline()
header = testfile.readline()

Is "hdr" short for header here? I don't recall seeing this particular abbreviation used before, so it's not entirely clear to me. But, it seems to be the only thing that would make sense in this context. Unless this is a very common and well known abbreviation that I wasn't aware of, it doesn't seem to be worth saving the extra three characters. It looks like it's only used a total of 3 times within these tests, so changing the variable name shouldn't be an issue. Explicit > Implicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'm pretty sure it's intended to mean "header". I'd write the full word in this context if I were writing this code new.

(General reason I've left it this way is in the top-level comment above.)

Fortunately because this variable is local to this small and simple bit of code, there isn't much the reader has to look at to see what it does. Even if the variable were named "x" or "var31", the code wouldn't be made that much harder to read. Maybe even easier, as you wouldn't be tempted to try to guess what the name meant 😉

self.assertTrue(unicodedata.is_normalized("NFD", c5))

self.assertTrue(unicodedata.is_normalized("NFKC", c4))
self.assertTrue(unicodedata.is_normalized("NFKD", c5))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial nit: The amount of whitespace in this section seems a bit excessive. I would recommend condensing it down to something like this:

            self.assertTrue(unicodedata.is_normalized("NFC", c2))
            self.assertTrue(unicodedata.is_normalized("NFC", c4))
            self.assertTrue(unicodedata.is_normalized("NFD", c3))
            self.assertTrue(unicodedata.is_normalized("NFD", c5))
            self.assertTrue(unicodedata.is_normalized("NFKC", c4))
            self.assertTrue(unicodedata.is_normalized("NFKD", c5))

@@ -175,8 +176,7 @@ def test_normalize(self):
self.assertRaises(TypeError, self.db.normalize)
self.assertRaises(ValueError, self.db.normalize, 'unknown', 'xx')
self.assertEqual(self.db.normalize('NFKC', ''), '')
# The rest can be found in test_normalization.py
# which requires an external file.
# See also comprehensive tests in NormalizationTest below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# See also comprehensive tests in NormalizationTest below.

I agree with the removal of the previous comment since this has been condensed into one file, but I don't think the new comment is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly necessary, but I think there's a possible confusion where you're looking through all these tests which test one function after another, and this fits right in, and so you might naturally infer that this is the test (or the main test) for this function, and not go looking for others. So I think a bit of explicitness remains helpful.

Copy link
Contributor

@aeros aeros Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly necessary, but I think there's a possible confusion where you're looking through all these tests which test one function after another, and this fits right in, and so you might naturally infer that this is the test (or the main test) for this function, and not go looking for others.

Hmm, fair point. That could potentially be useful to point out to readers.

I'm not completely against this comment remaining, but I think the other two (especially "Perform tests" on line 370) should be removed. Comments that don't provide additional information or context end up adding unneeded noise/clutter to the code. Each individual unneeded comment isn't much of an issue, but it can definitely add up over time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest completing the unification by putting this case in NormalizationCases (and renaming it test_edge_cases or something; that's what it seems to be about).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure -- done.

continue
c1,c2,c3,c4,c5 = [self.unistr(x) for x in line.split(';')[:-1]]

# Perform tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Perform tests

I don't see much of a purpose in this comment, it doesn't seem to provide any additional information or context for the assertions below it.

TESTDATAFILE = "NormalizationTest.txt"
TESTDATAURL = f"http://www.pythontest.net/unicode/{unicodedata.unidata_version}/{TESTDATAFILE}"

# Hit the exception early
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Hit the exception early

Similar to my other recommendation, this comment doesn't seem to provide any particularly useful information or context.

Copy link
Contributor Author

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aeros167 !

I agree with most of your comments in that if I were writing this code new, or reviewing it when newly-written, I'd prefer the code be the way you suggest.

I've left things like a short local-variable name, and the couple of unnecessary small comments, the way they are because I've largely taken a policy of leaving the code unchanged wherever possible. That's because I think that's helpful in reducing the work needed to read and review the change.

Especially when moving a good-sized chunk of code around, if a reader wants to 100% check my work for themselves, they'll have to go comparing the deleted and added code bit by bit and spot where it differs -- then study each of those differences to understand what they do. (Tools like git diff --color-moved help in skimming past what didn't change, but when something did change you still have to study that.) In this case there were a few things that did change, as described in the PR description. But it helps if that's kept to a minimum.

@@ -172,6 +171,9 @@ def test_named_sequences_sample(self):

def test_named_sequences_full(self):
# Check all the named sequences
def check_version(testfile):
hdr = testfile.readline()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'm pretty sure it's intended to mean "header". I'd write the full word in this context if I were writing this code new.

(General reason I've left it this way is in the top-level comment above.)

Fortunately because this variable is local to this small and simple bit of code, there isn't much the reader has to look at to see what it does. Even if the variable were named "x" or "var31", the code wouldn't be made that much harder to read. Maybe even easier, as you wouldn't be tempted to try to guess what the name meant 😉

@@ -175,8 +176,7 @@ def test_normalize(self):
self.assertRaises(TypeError, self.db.normalize)
self.assertRaises(ValueError, self.db.normalize, 'unknown', 'xx')
self.assertEqual(self.db.normalize('NFKC', ''), '')
# The rest can be found in test_normalization.py
# which requires an external file.
# See also comprehensive tests in NormalizationTest below.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly necessary, but I think there's a possible confusion where you're looking through all these tests which test one function after another, and this fits right in, and so you might naturally infer that this is the test (or the main test) for this function, and not go looking for others. So I think a bit of explicitness remains helpful.

@gnprice
Copy link
Contributor Author

gnprice commented Sep 9, 2019

(also for ease of cross-reference here's where @benjaminp originally suggested this change: #15558 (comment) )

@aeros
Copy link
Contributor

aeros commented Sep 9, 2019

@gnprice:

I've left things like a short local-variable name, and the couple of unnecessary small comments, the way they are because I've largely taken a policy of leaving the code unchanged wherever possible. That's because I think that's helpful in reducing the work needed to read and review the change.

I can definitely understand the logic here, the less changes involved in a PR the easier it is to review. However, I think some of the minor refactoring is needed over time. If everyone were to follow the mentality of leaving the surrounding code as unchanged as possible, these minor issues would add up to be more significant over time.

I wouldn't mind attempting to apply these minor changes in a separate PR, but it can be incredibly difficult to find a core developer with the time to review those types of changes by themselves. From a review perspective, it's usually easier to group them in with other changes. That's why I usually try to suggest for several of them to be addressed when a PR happens to be modifying (or moving in this case) the surrounding code.

It's a fine balance though, and I certainly don't expect all of my suggestions to make it to the final merged PR. As long as some of them are addressed, it provides a gradual improvement in the long term.

Copy link
Contributor

@benjaminp benjaminp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to submit a PR that does the minimal move because it eases review and clarifies version history.

@@ -175,8 +176,7 @@ def test_normalize(self):
self.assertRaises(TypeError, self.db.normalize)
self.assertRaises(ValueError, self.db.normalize, 'unknown', 'xx')
self.assertEqual(self.db.normalize('NFKC', ''), '')
# The rest can be found in test_normalization.py
# which requires an external file.
# See also comprehensive tests in NormalizationTest below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest completing the unification by putting this case in NormalizationCases (and renaming it test_edge_cases or something; that's what it seems to be about).

@gnprice
Copy link
Contributor Author

gnprice commented Sep 10, 2019

but it can be incredibly difficult to find a core developer with the time to review those types of changes by themselves. From a review perspective, it's usually easier to group them in with other changes.

Yeah, you've put your finger on a real tension here.

Thing is, while it is easier to get small code cleanups in when they're squashed into some bigger, otherwise-independent change... I think it's actually more review work that way. (For one-line cleanups like deleting those unhelpful comments, the stakes are low all around, but I think this is clearer for cleanups that are just a little more complex -- and often also higher-value.)

So the current equilibrium of how people write and send PRs, how they review them, and how they merge them, naturally pushes you in that direction; but I think it's actually one of the contributing factors in the shortage we have of review/merge effort. The thing about being an equilibrium is that it's infeasible to totally opt out -- this PR already contained what would have been at least three separate changes if we were in a workflow equilibrium where many, small, focused changes was the norm -- but I try to lean in that direction.

Which also includes sending some small cleanups as PRs of their own. One such was def97c9 (in this same file, as it happens). One merged just today was 3cbc23a -- waited a few weeks, but it's just as good now as then, and I expect it was very little work for the reviewers/merger. If you run git log --stat -p --author=gnprice in your clone of the repo, you'll see more examples. Looking through the logs in general, there are others as small as 3aa48b8.

@aeros
Copy link
Contributor

aeros commented Sep 10, 2019

@benjaminp:

I think it's fine to submit a PR that does the minimal move because it eases review and clarifies version history.

Would you suggest having code comment cleanups in their own PR then? My main concern was that it would end up getting buried. Code comments would generally fall under documentation, and the currently active doc experts are stretched thin as is, in comparison to the volume of doc-only PRs.

In theory, any core dev could review doc PRs, but most tend to be focused on their areas of expertise/interest. This completely makes sense and I definitely don't blame anyone for the situation, but it makes it rather difficult to have them reviewed/merged.

This also applies to rather small PRs, see #14698 as an example. I currently have 3 doc-only PRs open and have several other ideas for ones that I'd like to open, but I wanted to avoid adding further to the list of doc PRs that are in need review from a core developer.

We could use some more active documentation experts, but that's definitely easier said than done. I would be interested in helping with that issue, but I think I need more experience and mentoring first. I only recently became a member of the Triage team last month and started actively contributing in June.

@aeros
Copy link
Contributor

aeros commented Sep 10, 2019

@gnprice:

Thing is, while it is easier to get small code cleanups in when they're squashed into some bigger, otherwise-independent change... I think it's actually more review work that way.

So the current equilibrium of how people write and send PRs, how they review them, and how they merge them, naturally pushes you in that direction; but I think it's actually one of the contributing factors in the shortage we have of review/merge effort

Hmm, interesting. I hadn't considered that but I think you might be correct. I think a significant part of my push in that direction has been the slower response times to documentation only PRs (which I detailed more in my comment above). I've tried to assist with alleviating this issue by focusing my efforts on PR review, but the main bottleneck seems to be on the final core dev review. Perhaps smaller and more specific PRs could help with that.

I'm uncertain though if I should try to leave the code cleanup and minor documentation PRs on the backlog for now while I wait for the others to be merged and continue to focus on review, or try to do more of a balance of both. Do you have any recommendations?

@benjaminp
Copy link
Contributor

@benjaminp:

I think it's fine to submit a PR that does the minimal move because it eases review and clarifies version history.

Would you suggest having code comment cleanups in their own PR then? My main concern was that it would end up getting buried. Code comments would generally fall under documentation, and the currently active doc experts are stretched thin as is, in comparison to the volume of doc-only PRs.

I don't think this problem is specific to doc-only PRs. In general, review time ≪ amount of PRs to review.

@benjaminp benjaminp merged commit 1ad0c77 into python:master Sep 10, 2019
@aeros
Copy link
Contributor

aeros commented Sep 10, 2019

@benjaminp:

I don't think this problem is specific to doc-only PRs. In general, review time ≪ amount of PRs to review.

Oh yeah definitely, it's not an exclusive issue to doc-only PRs, but from my experience thus far it seems to be a bit more of an issue for them than it is for most other PRs. At least from my personal experience, I usually see significantly faster response times from CCing other experts than I do when trying to CC the documentation experts. There are some other areas which have longer response times than the documentation experts, but that's usually due to a lack of an active expert, such as for argparse.

@gnprice gnprice deleted the pr-normalize-test-file branch September 11, 2019 04:51
@gnprice
Copy link
Contributor Author

gnprice commented Sep 11, 2019

Thanks @benjaminp for the merge!

At least from my personal experience, I usually see significantly faster response times from CCing other experts than I do when trying to CC the documentation experts.

@aeros167 One thought this brings to mind: when submitting an issue or PR in the documentation for some area, you might get a response more easily by CCing experts in that area rather than in documentation generally.

I'd expect documentation experts to be most able to help for things in the docs infrastructure (e.g. the stuff in Doc/tools/), and things that cut across large parts of the documentation. (Plus of course all the other things the same person happens to be an expert on.) On the other hand if someone's taking care of e.g. a given stdlib module, I think they'd typically see that module's docs as part of what they're taking care of and are well-prepared to review.

websurfer5 pushed a commit to websurfer5/cpython that referenced this pull request Jul 20, 2020
…ythonGH-15712)

Having these in a separate file from the one that's named after the
module in the usual way makes it very easy to miss them when looking
for tests for these two functions.

(In fact when working recently on is_normalized, I'd been surprised to
see no tests for it here and concluded the function had evaded being
tested at all.  I'd gone as far as to write up some tests myself
before I spotted this other file.)

Mostly this just means moving all the one file's code into the other,
and moving code from the module toplevel to inside the test class to
keep it tidily separate from the rest of the file's code.

There's one substantive change, which reduces by a bit the amount of
code to be moved: we drop the `x > sys.maxunicode` conditional and all
the `RangeError` logic behind it.  Now if that condition ever occurs
it will cause an error at `chr(x)`, and a test failure.  That's the
right result because, since PEP 393 in Python 3.3, there is no longer
such a thing as an "unsupported character".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip news tests Tests in the Lib/test dir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants