Warts of the Answer Testing Framework #307

mabruzzo · 2023-05-11T18:04:57Z

I just wanted to briefly highlight 2 warts with the answer-testing framework and ask if people with more experience with this style of testing have recommendations.

Procedure for updating the gold-standard tag

The primary point I wanted to raise is that the procedure for updating the gold-standard tests is unclear. I've now updated the gold-standard tag two times. Both times I have done the following after a PR gets approved that requires an update to the gold-standard tag:

I push an update to .circleci/config.yml that changes the gold-standard tag to a new number (that doesn't exist yet). Generally one or more answer-tests was failing before this step. After this step, all answer test will fail.
I then merge the PR (at this point the answer tests are still failing)
Then I pull the updated main branch locally, check it out on my machine, add a new lightweight tag to the most recent commit on the main branch (e.g. by calling git tag gold-standard-003), and then push the new tag to main enzo-e repository (e.g. by calling git push enzo-e gold-standard-003)
Finally, I tell circleci to rerun the tests on the main branch.

I think I'm probably doing something wrong here. Is there a better way to do this? Should I be pushing the gold-standard tag to my branch before pulling in the PR? (I haven't done this because I haven't been sure whether the tag from my local branch would transfer to the main branch)

tag update required every time a new PR introduces new answer tests

Currently, when circleci generates the "reference answers" for the gold-standard, it uses the answer test machinery that was shipped with that version of Enzo-E. That means that it will not run newly introduced answer tests. Is this what enzo-classic does? Or should we change things so that we are always using the latest version of the answer-test machinery when generating the "reference-answers"?

The text was updated successfully, but these errors were encountered:

brittonsmith · 2023-07-28T10:38:40Z

We had a discussion about this back in San Diego. I thought I would down some of the conclusions we reached (to my memory at least) as I'm now looking at PR #318.

I think we concluded that moving from tracking the gold standard with a git tag to tracking with a changeset was the way to go as it should basically solve all of the first issue mentioned here.

In enzo-dev, we run the tests straight out of the repo, so it does require updating the gold standard when new tests are added. If we don't want to do that, I can see a couple options.

Make a copy of the testing directory to run from on circleci.
Make the test framework a separate repository that exists inside the main repo as a submodule.

gregbryan · 2023-07-28T13:29:00Z

One question from a non-expert: I always thought that a tag was just a convenient way to name a changeset, but it seems that pushing and pulling git tags is not really straightforward. In particular, "git push" doesn't transfer the tags. Is this true for both lightweight and annotated tags? Assuming that's right, it seems like we might as well just use the changeset as Britton suggests. The only downside I see is that change sets are not obviously ordered so it's not immediately obvious that one changeset supersedes another (as it is for tags, gold-standard-004 is obviously intended to supersede gold-standard-003). But that doesn't seem like a game changer to me.

The other issue (requiring a new tag every time a new answer test is added) seems less straightforward. One might argue this is a feature since it forces the updating of the gold-standard once new tests are added, but the correct behavior doesn't seem obvious to me.

mabruzzo · 2023-07-28T16:24:18Z

Since there's some activity, I wanted to share some more recent thoughts I had about this briefly. But I'll circle back in a few days to give a more complete response. (I think you both raise some very interesting points that I want to address).

I think we concluded that moving from tracking the gold standard with a git tag to tracking with a changeset was the way to go as it should basically solve all of the first issue mentioned here.

That's what I remember too. And we discussed making a text-file that listed gold-standard commit hashes (from oldest to newest). I actually proto-typed this solution and I'm not so sure that this is the optimal choice. I'll try to circle back on Monday or so to provide some more details.

My concerns are somewhat related to Greg's point about the ordering of the gold-standard and the fact that updating the gold-standard updates the answers to ALL answer-tests. Essentially, I think I can imagine a scenario, where 2 unrelated PRs updating the gold-standard because they want to update answers to tests of unrelated functionality. If they each lightly touch common functionality, it's possible that the combination of PRs could silently break that common functionality (even if that common functionality has an answer-test of its own). I'll explain more details about this scenario in the follow-up message.

brittonsmith · 2023-07-31T10:04:38Z

I didn't know about annotated tags. I like the fact that you can add information to them. The more I think about it, the more I can see value in preserving order information (i.e., gold-standard-004). There is still the matter of the procedure for updating them in a PR, but I'll stop here since it looks like Matthew has something coming.

mabruzzo · 2023-08-03T17:57:33Z

I gave the "problem scenario" I was talking about more thought and came up with a simpler case. At this point, the problem that arises in this scenario doesn't occur because of any of the points discussed here. But I do think that going to a commit-based approach may make this scenario harder for a reviewer to catch. I discuss that down below.

I'm coming around to Greg's argument that updating the gold-standard once new tests are added - since it requires the developer/reviewer to manually engage with updating the gold-standard. But maybe we can do something clever to automate part of the process? Maybe we could define a GitHub action to help? Or maybe we just introduce a simple python script that people can invoke...

Setting Up the Scenario

Consider 2 code-features A and B, which are mostly independent and are independently tested by answer-tests called test-A & test-B, but have some light-coupling. (If it helps, one could imagine that A is a hydro-solver while B is something like gravity, which stands on its own, but introduces source terms that must be considered in the hydro-solver)

Suppose 2 developers simultaneously decide to modify features A and B in separate branches called update-A and update-B:

The modifications in update-A does NOT require any changes to any gold-standard answers.
The modifications in update-B DO necessitate an expected update to the gold standard answer for test-B.

Let’s further suppose that the changes from update-A get merged into the main-branch first. Finally, let’s assume that merging the main branch into update-B at this point WILL break feature A (and consequently cause test-A to start failing).

This scenario is sketched out by the image hidden by this spoiler tag.

(The nodes in the image represent discrete commits, while the checkmarks/x’s denote the status of the answer test while using the most recent gold-standard)

The Problem

If we ignore rebasing, there are essentially 3 control-flows in which changes from update-B get merged into the main branch and all of the answer-tests "pass".

Consider the “proper” sequence of events (illustrated below as CONTROL FLOW 1):
1. First, the latest version of the main branch is merged into the update-B branch (test-A fails)
2. Next, another commit is introduced to the update-B branch fixing the problems in feature A (causing test-A to pass again)
3. Finally, the update-B branch is merged into the main branch and the gold-standard is updated (all tests now "pass")
Now let’s consider an incorrect sequence events (illustrated below as CONTROL FLOW 2):
1. First, the latest version of the main branch is merged into the update-B branch (test-A fails)
2. the update-B branch is merged into the main branch and the gold-standard is updated (all tests now "pass")
Alternatively, if there aren’t merge conflicts, the update-B branch can be directly merged into the main branch and the gold-standard is updated (This isn’t illustrated since the outcome is identical to CONTROL FLOW 2)

While all of the gold-standard tests "pass" at the end of the control-flows 2 & 3, feature A remains broken. This is obviously bad because the tests lead us to believe that it works.

Sketch of subsequent control-flows.

Relevance of this Problem to this discussion

Fundamentally, the problem that arises in this scenario boils down to the fact that updating the gold-standard updates the test answers for all answer-tests, regardless of whether we expect the test answers to change.

I don’t really see a practical solution designing the test-suite to prevent this problem. The true-way to solve it is by having each test tracked its own gold-standard - but that doesn’t seem particularly practical from a CI perspective (you would need to build multiple versions of the code to construct all of the test-answers).
Consequently, the responsibility falls on the developer and the reviewers to avoid these scenarios. While this scenario can arise whether we track the gold-standard with tags or a list of commit-hashes, I suspect that the latter case would make it easier for this sort of situation to occur (e.g. if a new-developer updates the gold-standard tag too quickly)

brittonsmith · 2023-08-07T10:08:08Z

More than anything, this highlights the need for human intervention, particularly on the part of whoever is managing the review of a PR. I'm in favor of establishing and documenting a procedure for updating gold standards. My opinion as of now is to go with annotated tags and for the PR manager to do the gold standard update following the merger of the PR to main. This is simple enough that we probably don't need any scripts, just documentation.

One thing we could do is ask the PR manager to list the relevant tests whose answers needed updating in the comment to the PR. Or, we could store a yaml file that lists the last tag/commit to change answers for each test. Just spitballing here.

mabruzzo · 2023-10-05T15:37:56Z

I wanted to pick this conversation back up.

A while back, @gregbryan and I had previously discussed recording a gold-standard commit/tag on a per-test basis (or for a group of tests). I think Britton also highlighted this approach as well while spitballing. I think this would be the optimal approach since it's much more explicit which test-answers are updated by a change (under such an approach we would be much less likely to sweep a broken test-answer into the gold-standard). Therefore, I think it's probably worth discussing this approach a little further (even if we don't go with it).

I think there are a few points worth highlighting, that dictate whether this is actually a viable approach. (For the sake of simplicity, let's assume that the gold-standards -- whether they're tags or commits -- are sequenced).

Most importantly: how do we have CI build the test answers for a given code-configuration? If we assume that our gold-standard tags/commits are sequenced, then there are essentially 3 main approaches:
1. most robust solution: build the answers from the gold standard associated with each test.
  - Unfortunately, this is intractable with the current CI setup.
  - This would be viable if we had control over the hardware where CI occurs, this would be easier. We could then cache the gold-standard answers to our tests (either on the hardware where the tests are run OR in a separate git repository)
  - it might be worth looking into pricing for this... maybe it would be doable with aws or something on free-credits
  - Alternatively, we could increase the tolerances on our gold-standard testing to make test-answers valid across machines/compilers (in this case, we could use cache the answers in a separate git repository)
2. simple solution: build all test answers from the latest gold-standard. This is essentially what we are currently doing; unfortunately, it's vulnerable to the same failure modes.
3. compromise: for a given code-configuration (i.e. choice of precision), we essentially run the answer-tests twice, but with 2 different configurations. First, check the tests in the current changeset against the latest gold-standard. Then, confirm that answers to tests don't unexpectedly change between the most recent gold-standard and the previous changeset (obviously we'd only compare answers that shouldn't change).
  - by ensuring that answers to tests don't unexpectedly change between the latest and previous gold-standard in every PR, I think you can prove by induction that the answers to tests haven't changed since the associated gold-standard tag. (this assumes that there's only 1 or fewer updates to the gold-standard in each PR -- we could probably enforce)
  - I was initially hesitant to take this approach because I was a little concerned about how exactly we would handle weird hard-to-anticipate weird scenarios where tests start unexpectedly failing (maybe from a race-condition OR a dependency-update). But, I think the solution is probably simple - just update the associated gold-standard or rewrite the tests (while it's not elegant, it's what we would do now).
  - we could make things more efficient by automatically detecting whether the gold-standard changes in a given PR. If it doesn't change, we can probably skip the consistency check between the latest & previous PR... (but this is an optimization)
it might be tedious to handle cases where a change (e.g. changing precision of timestep from double to enzo_float) requires the gold standard to be updated for all answer tests. But I'm not that concerned. We could probably do something with search-and-replace...
How do we record the gold-standard commit/tag associated with a given test?
- I was initially tempted to store this info within the pytest scripts (maybe as a class-attribute)
- But, I think Britton's suggestion is probably better:
  
  Or, we could store a yaml file that lists the last tag/commit to change answers for each test.
  
  @brittonsmith is this approach similar to what yt does?

Let me know your thoughts. When I started writing up this response, I was initially hesitant to take this approach (recording gold-standards on a per-test basis), but I'm actually inclined to go this route.

gregbryan mentioned this issue Sep 26, 2023

Update CircleCI Python several versions for future proofing enzo-project/enzo-dev#223

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warts of the Answer Testing Framework #307

Warts of the Answer Testing Framework #307

mabruzzo commented May 11, 2023 •

edited

Loading

brittonsmith commented Jul 28, 2023

gregbryan commented Jul 28, 2023

mabruzzo commented Jul 28, 2023

brittonsmith commented Jul 31, 2023

mabruzzo commented Aug 3, 2023

brittonsmith commented Aug 7, 2023

mabruzzo commented Oct 5, 2023

Warts of the Answer Testing Framework #307

Warts of the Answer Testing Framework #307

Comments

mabruzzo commented May 11, 2023 • edited Loading

Procedure for updating the gold-standard tag

tag update required every time a new PR introduces new answer tests

brittonsmith commented Jul 28, 2023

gregbryan commented Jul 28, 2023

mabruzzo commented Jul 28, 2023

brittonsmith commented Jul 31, 2023

mabruzzo commented Aug 3, 2023

Setting Up the Scenario

The Problem

Relevance of this Problem to this discussion

brittonsmith commented Aug 7, 2023

mabruzzo commented Oct 5, 2023

mabruzzo commented May 11, 2023 •

edited

Loading