-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warts of the Answer Testing Framework #307
Comments
We had a discussion about this back in San Diego. I thought I would down some of the conclusions we reached (to my memory at least) as I'm now looking at PR #318. I think we concluded that moving from tracking the gold standard with a git tag to tracking with a changeset was the way to go as it should basically solve all of the first issue mentioned here. In enzo-dev, we run the tests straight out of the repo, so it does require updating the gold standard when new tests are added. If we don't want to do that, I can see a couple options.
|
One question from a non-expert: I always thought that a tag was just a convenient way to name a changeset, but it seems that pushing and pulling git tags is not really straightforward. In particular, "git push" doesn't transfer the tags. Is this true for both lightweight and annotated tags? Assuming that's right, it seems like we might as well just use the changeset as Britton suggests. The only downside I see is that change sets are not obviously ordered so it's not immediately obvious that one changeset supersedes another (as it is for tags, gold-standard-004 is obviously intended to supersede gold-standard-003). But that doesn't seem like a game changer to me. The other issue (requiring a new tag every time a new answer test is added) seems less straightforward. One might argue this is a feature since it forces the updating of the gold-standard once new tests are added, but the correct behavior doesn't seem obvious to me. |
Since there's some activity, I wanted to share some more recent thoughts I had about this briefly. But I'll circle back in a few days to give a more complete response. (I think you both raise some very interesting points that I want to address).
That's what I remember too. And we discussed making a text-file that listed gold-standard commit hashes (from oldest to newest). I actually proto-typed this solution and I'm not so sure that this is the optimal choice. I'll try to circle back on Monday or so to provide some more details. My concerns are somewhat related to Greg's point about the ordering of the gold-standard and the fact that updating the gold-standard updates the answers to ALL answer-tests. Essentially, I think I can imagine a scenario, where 2 unrelated PRs updating the gold-standard because they want to update answers to tests of unrelated functionality. If they each lightly touch common functionality, it's possible that the combination of PRs could silently break that common functionality (even if that common functionality has an answer-test of its own). I'll explain more details about this scenario in the follow-up message. |
I didn't know about annotated tags. I like the fact that you can add information to them. The more I think about it, the more I can see value in preserving order information (i.e., gold-standard-004). There is still the matter of the procedure for updating them in a PR, but I'll stop here since it looks like Matthew has something coming. |
I gave the "problem scenario" I was talking about more thought and came up with a simpler case. At this point, the problem that arises in this scenario doesn't occur because of any of the points discussed here. But I do think that going to a commit-based approach may make this scenario harder for a reviewer to catch. I discuss that down below. I'm coming around to Greg's argument that updating the gold-standard once new tests are added - since it requires the developer/reviewer to manually engage with updating the gold-standard. But maybe we can do something clever to automate part of the process? Maybe we could define a GitHub action to help? Or maybe we just introduce a simple python script that people can invoke... Setting Up the ScenarioConsider 2 code-features A and B, which are mostly independent and are independently tested by answer-tests called test-A & test-B, but have some light-coupling. (If it helps, one could imagine that A is a hydro-solver while B is something like gravity, which stands on its own, but introduces source terms that must be considered in the hydro-solver) Suppose 2 developers simultaneously decide to modify features A and B in separate branches called update-A and update-B:
Let’s further suppose that the changes from update-A get merged into the main-branch first. Finally, let’s assume that merging the main branch into update-B at this point WILL break feature A (and consequently cause test-A to start failing). This scenario is sketched out by the image hidden by this spoiler tag.(The nodes in the image represent discrete commits, while the checkmarks/x’s denote the status of the answer test while using the most recent gold-standard)The ProblemIf we ignore rebasing, there are essentially 3 control-flows in which changes from update-B get merged into the main branch and all of the answer-tests "pass".
While all of the gold-standard tests "pass" at the end of the control-flows 2 & 3, feature A remains broken. This is obviously bad because the tests lead us to believe that it works. Relevance of this Problem to this discussionFundamentally, the problem that arises in this scenario boils down to the fact that updating the gold-standard updates the test answers for all answer-tests, regardless of whether we expect the test answers to change.
|
More than anything, this highlights the need for human intervention, particularly on the part of whoever is managing the review of a PR. I'm in favor of establishing and documenting a procedure for updating gold standards. My opinion as of now is to go with annotated tags and for the PR manager to do the gold standard update following the merger of the PR to main. This is simple enough that we probably don't need any scripts, just documentation. One thing we could do is ask the PR manager to list the relevant tests whose answers needed updating in the comment to the PR. Or, we could store a yaml file that lists the last tag/commit to change answers for each test. Just spitballing here. |
I wanted to pick this conversation back up. A while back, @gregbryan and I had previously discussed recording a gold-standard commit/tag on a per-test basis (or for a group of tests). I think Britton also highlighted this approach as well while spitballing. I think this would be the optimal approach since it's much more explicit which test-answers are updated by a change (under such an approach we would be much less likely to sweep a broken test-answer into the gold-standard). Therefore, I think it's probably worth discussing this approach a little further (even if we don't go with it). I think there are a few points worth highlighting, that dictate whether this is actually a viable approach. (For the sake of simplicity, let's assume that the gold-standards -- whether they're tags or commits -- are sequenced).
Let me know your thoughts. When I started writing up this response, I was initially hesitant to take this approach (recording gold-standards on a per-test basis), but I'm actually inclined to go this route. |
I just wanted to briefly highlight 2 warts with the answer-testing framework and ask if people with more experience with this style of testing have recommendations.
Procedure for updating the gold-standard tag
The primary point I wanted to raise is that the procedure for updating the gold-standard tests is unclear. I've now updated the gold-standard tag two times. Both times I have done the following after a PR gets approved that requires an update to the gold-standard tag:
I push an update to
.circleci/config.yml
that changes the gold-standard tag to a new number (that doesn't exist yet). Generally one or more answer-tests was failing before this step. After this step, all answer test will fail.I then merge the PR (at this point the answer tests are still failing)
Then I pull the updated
main
branch locally, check it out on my machine, add a new lightweight tag to the most recent commit on themain
branch (e.g. by callinggit tag gold-standard-003
), and then push the new tag to main enzo-e repository (e.g. by callinggit push enzo-e gold-standard-003
)Finally, I tell circleci to rerun the tests on the main branch.
I think I'm probably doing something wrong here. Is there a better way to do this? Should I be pushing the gold-standard tag to my branch before pulling in the PR? (I haven't done this because I haven't been sure whether the tag from my local branch would transfer to the main branch)
tag update required every time a new PR introduces new answer tests
Currently, when circleci generates the "reference answers" for the gold-standard, it uses the answer test machinery that was shipped with that version of Enzo-E. That means that it will not run newly introduced answer tests. Is this what enzo-classic does? Or should we change things so that we are always using the latest version of the answer-test machinery when generating the "reference-answers"?
The text was updated successfully, but these errors were encountered: