-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move and enable visual tests #9961
Conversation
Thank you for opening a new pull request. Before your PR can be merged it will first need to pass continuous integration tests and be reviewed. Sometimes the review process can be slow, so please be patient. While you're waiting, please feel free to review other open PRs. While only a subset of people are authorized to approve pull requests for merging, everyone is encouraged to review open pull requests. Doing reviews helps reduce the burden on the core team and helps make the project's code better for everyone. One or more of the the following people are requested to review this:
|
Pull Request Test Coverage Report for Build 5522051947
💛 - Coveralls |
Moved all of the failures to one spot so there should be less friction for artifact creation and readability. Looks like it might need an approval to run the tests now? Once the new artifact creation is tested in the pipeline we can revert and should be good to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking at this. I think the Azure run actually just got lost in the webhooks aether - it happens sometimes. The "authorisation required" thing is about a couple of the tests that run on GitHub Actions, and won't actually be affected by these changes at all. Pushing a new commit to this branch should retrigger the webhook to get Azure going again.
It'll definitely be good to get these tests running again in CI. I haven't been paying attention through the process of developing this PR, but just to check: have you seen any flakiness at all in any of the tests? I don't know if there's any stochastic elements to any of the visualisations, but hopefully not.
@staticmethod | ||
def _image_path(image_name): | ||
return os.path.join(RESULTDIR, image_name) | ||
|
||
@staticmethod | ||
def _reference_path(image_name): | ||
return os.path.join(TEST_REFERENCE_PATH, image_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I'd roughly prefer these just to be inlined into the call sites with the constants becoming pathlib.Path
instances. It'd be the same or less typing, even at the point of call:
self._reference_path("my_image.png") # current
TEST_REFERENCE_PATH / "my_image.png" # alternate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chose not to inline these for the moment - but I can see the reasoning.
ratio = self._similarity_ratio(self._image_path(fname), self._reference_path(fname), fname) | ||
assert ratio == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're using this exact same form everywhere (and it looks like we are), perhaps we could factor it out into a
def assertImageSimilarity(self, name: str, similarity: int)
method? Looking at what self._similarity_ratio
does, I'd expect some of its work to be about handling the assertion. We already effectively require that our test images and the references have the same file name, so it doesn't seem too disruptive to factor that out into the assertion method.
At the least, here and everywhere, please can we use unittest
assertions rather than bare assert
, so we can see the full error messages in the logs if there's a failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you were aiming to have the assertions fire only after all the tests ran, I'd suggest using ddt
to parametrise the multi-circuit tests so there's only one assertion, or use with self.subTest
to make internally separate test contexts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to use the unittest assertEqual
.
Left as an assert after the fact, after changing _similarity_ratio
to _save_diff
I think it makes sense that the function saves the diff and returns how different it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for sticking with this. The change to from similarity_ratio
to save_diff
does help a lot with legibility, thanks. It's still maybe not ideal in terms of the ratio between a magic number shared between two places, but it's not the end of the world.
I pushed up a minor couple of commits to tidy up a file permission and remove old documentation references to Binder (that were a bit useless anyway). Thanks for sticking through this to the end!
In the recently merged Qiskit#9961 the visual comparison tests were re-added to run a visual diff on the visualization output against a known reference image. These improved tests provide us missing coverage and have already found potential bugs in proposed changes, see: Qiskit#10208 (comment) However, the comparison were looking for a 100% match between the reference image and the generated output. In an ideal world we'd be able to rely on this. But in practice an exact comparisons of the images are quite flaky. This is because the actual visualization output is a function of not only our python code in Qiskit, but also upstream python dependencies (such as matplotlib, seaborn, etc) and more importantly those libraries leverage many system libraries and packages. This includes the C libraries for image formats (e.g. libpng), but also things like fonts. A version bump in our CI image's package versions can cause subtle differences in the output of visualizations. The CI images are controlled by Microsoft/Github and isn't something we we can't really depend on being constant forever (even if we used docker for a base test image the upstream images we would build off of that aren't 100% fixed either and could cause similar issues). We've had numerous issues with this in the past with image comparison tests (especially when you start running them cross-platform which isn't the case here) and is why we've oscillated between having them in the test suite and not throughout the development history of Qiskit. This commit adds a 1% tolerance to the comparison ratio so instead of needing a 100% match a 99% match is good enough. While this does technically reduce the coverage and a potentially invalid image could slip in that 1% difference, the chance of that happening are fairly unlikely especially weighed against the likelihood of a system change causing a CI blockage (which happened within one day of merging Qiskit#9961). The only other option is to disable these tests as voting in the CI job, which would all but remove their utility because they will likely go stale (as the testing harness before this did). We may still end up making them non-voting anyway despite the coverage gains if we can't reliably run the tests in CI.
In the recently merged #9961 the visual comparison tests were re-added to run a visual diff on the visualization output against a known reference image. These improved tests provide us missing coverage and have already found potential bugs in proposed changes, see: #10208 (comment) However, the comparison were looking for a 100% match between the reference image and the generated output. In an ideal world we'd be able to rely on this. But in practice an exact comparisons of the images are quite flaky. This is because the actual visualization output is a function of not only our python code in Qiskit, but also upstream python dependencies (such as matplotlib, seaborn, etc) and more importantly those libraries leverage many system libraries and packages. This includes the C libraries for image formats (e.g. libpng), but also things like fonts. A version bump in our CI image's package versions can cause subtle differences in the output of visualizations. The CI images are controlled by Microsoft/Github and isn't something we we can't really depend on being constant forever (even if we used docker for a base test image the upstream images we would build off of that aren't 100% fixed either and could cause similar issues). We've had numerous issues with this in the past with image comparison tests (especially when you start running them cross-platform which isn't the case here) and is why we've oscillated between having them in the test suite and not throughout the development history of Qiskit. This commit adds a 1% tolerance to the comparison ratio so instead of needing a 100% match a 99% match is good enough. While this does technically reduce the coverage and a potentially invalid image could slip in that 1% difference, the chance of that happening are fairly unlikely especially weighed against the likelihood of a system change causing a CI blockage (which happened within one day of merging #9961). The only other option is to disable these tests as voting in the CI job, which would all but remove their utility because they will likely go stale (as the testing harness before this did). We may still end up making them non-voting anyway despite the coverage gains if we can't reliably run the tests in CI.
Summary
Moved and enabled the visual tests
Details and comments
Need to verify artifact publishing occurs on failure/missing, may need to write some purposeful failures.
Fixes #9892