-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster default runs #246
Faster default runs #246
Conversation
And set max examples default to 20
* Set max examples default to 20 * For faster default runs, over more test quality * Remove unreachable loading default settings path
(Seems my special case changes broke some things, couldn't figure it out now so will have a look next week 😅) |
@@ -15,7 +15,6 @@ | |||
|
|||
from reporting import pytest_metadata, pytest_json_modifyreport, add_extra_json_metadata # noqa | |||
|
|||
settings.register_profile("xp_default", deadline=800) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the deadline without this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
200 good catch, Ill make it default to 800 again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to think about deadlines a bit more (but in a separate issue). Deadline errors are very common, even with 800. It's not just with JAX, I see them with every library. I disabled the deadline in the array-api-compat and array-api-strict CI. It looks like numpy does too. I think there's just a fundamental incongruity with the hypothesis deadline and the arrays
strategy and the test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's just a fundamental incongruity with the hypothesis deadline and the arrays strategy and the test suite.
Yep thats a sound observation. For now I've made it default to 800 like before, and we can mull it over.
Need to update the README for the new max-examples default. |
Nothing should be controversial now so will merge as nice to get in. |
Hmm, the tests run suspiciously fast now (under a minute). Are we really still testing everything as rigorously as we used to? I'm all for faster runs, but we also want to make sure that actual failures are caught, ideally without having to rerun the tests multiple times. |
I think we're striking a nice balance. The CI job is an extreme example as |
Is this primarily coming from vectorization? If we're really running this fast, I don't see the need to reduce the default number of examples. |
Vectorization definitely helps, it's also that we're now reducing the max examples for the tests which aren't vectorized, plus only using a single example for the special case tests.
I think the slowest adoptor1 should be our lowest common denominator for how we set default runs of the test suite, i.e. JAX is taking 7mins (which even then is longer than ideal for a default run IMO). Then we can leave it to users—and ourselves—to bump the max examples for CI jobs. On a general note on examples: in my experience it's often only a few simple examples that cause failures. So the super interesting examples that come with a larger max are not that rewarding for the significant time add. Footnotes
|
Agreed with @honno's comments above - default number of examples should be optimized for fast test suite execution time, and having a high number of examples really isn't very interesting anyway. I'd honestly set it to 1, not 20. 1 is regular CI "is everything implemented and does it all work". 20/200/inf is mostly repeating the same thing over and over, and for when it does actually add value - that is fuzz testing. 7min is still slow for regular CI usage. |
Very strong disagree. A very low number of examples do not actually "run the tests" in the way that people think that they do, and is very misleading. If you look at the tests in the test suite, each test is testing 10 or 20 different things. It's very hard for a single, randomly selected example to be representative enough to check everything. Consider that different "examples" mean different combinations of things like shapes and dtypes. Testing one example, as Ralf suggests, would mean just checking an array with just one of the supported dtypes with some shape, and some values. I tested running This is how hypothesis works. It generates interesting examples, where "interesting" means either corner cases (like this one) or general cases (like a "random" array). But it can only do this if you give it a chance to. There's a reason the hypothesis default is 100 examples per run. If we set max-examples to 1, test_add would not test broadcasting, behavior on floating-point dtypes, complex number support, support for other integer dtypes, or even that def test_add():
x = array(0., dtype=uint8)
y = array(0., dtype=uint8)
assert_all(add(x, y), array(0, dtype=uint8)) This would obviously not be a very good test for the
Strong disagree here too. We should optimize for the common case, not the worst case. JAX can manually set their example count lower in their CI if they want to. It's easy to do that with the flags we've provided. But the default behavior should be something reasonable for most libraries, who aren't going to set that flag (and probably don't even realize that it is there). I agree that we need to make the CI times reasonable, but the other improvements like vectorization have done this enough that I don't see the need to lower the default to 20 any more. With the latest version of the test suite, NumPy 2.0 runs in 16 seconds with max-examples set to 100. If that's not fast enough for you, I really don't know what to say. Really, if anything, we should take this as a opportunity to increase the number of examples that are run by default and improve the default test coverage. Again, tests that don't get covered in the first (or tenth or hundredth) run of the tests are very common. Just look at how often I have to update the XFAILs file for array-api-compat because some rare error only pops up sometimes on CI https://github.com/data-apis/array-api-compat/commits/main/numpy-1-21-xfails.txt Note that I am only talking about the normal tests here. I am fine with keeping the special case tests at a lower number of examples. Although even there we should be careful that some special cases are not fully covered by one example, and we should be cautious of the apparent hypothesis behavior of giving zero variability when max-examples is set to 1. |
I do not believe a rigorous test suite would only take less than 1 min to run. This is simply impossible. Please do not over optimize. I second what Aaron said. |
Faster default runs
Well okay, that is then not going to be possible with 1 example, I didn't know that (and it doesn't have to be that way, it's a choice made per test by the test author it looks like). This is a little problematic I think for a test suite design if you care about runtime, because it also means that 20 examples isn't enough because you have no guarantees that the first 20 examples are testing the 20 different things. I guess it becomes a stochastic thing - 25 is probably still not enough to reliably test everything; 30 or 40 has a high probability of being comprehensive, etc.
This is definitely not impossible. The test suite for numpy runs in O(100 sec), and tests (a) 10x more functions, (b) more computationally expensive functions on average, and (c) actually needs to compare numerical accuracy (unlike this test suite, where there's no expected answer that must be met). So if the test suite was written with similar techniques/efficiency as for
Yes indeed. This is a problem. This is not regular test suite usage. Instead, this is fuzz testing. Random failures like that should not happen for regular regression testing.
Zero variability is what you want in regular usage. "flaky tests" are an issue for pretty much every large library. I know |
The tests run fast enough now that there's no need to lower this to 20. See the discussion at data-apis#246.
Ralf, it sounds like you are more discussing the merits of property-based testing in general. I don't know if it's really worth my time to try to convince you of its value. The fact is, hypothesis tests are the type of tests we chose to use for the array-api-tests test suite. It's my personal experience that you get much more value out of hypothesis testing if you embrace its paradigm instead of trying to fight it or shoehorn it into a traditional testing paradigm. I've made a PR resetting the default max-examples back to 100. #254. You can see at data-apis/array-api-strict#34 I set the max-examples to double that (200), and the whole CI runs in a minute (that includes installing dependencies). 100 is the default value that hypothesis developers themselves chose, so I think we shouldn't change that, especially to something lower, unless there's a good reason. |
Regarding special-cases, is hypothesis used at all now? I ran
(to pick a random example), and hypothesis didn't output anything. It might be worth continuing to use hypothesis at least for special cases that refer to multiple values, like this one. And also, as I said, ensuring that it actually varies the example across runs. |
Yes but not in a typical |
Some changes that significantly speed-up the run of the test suite:
unvectorized