-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify correctness of Taskcluster PR checks #13194
Comments
Forgot to mention the goals:
|
I changed the title to clarify that this issue is mostly about PR checks on Taskcluster. On the other hand, I'm verifying the master builds on Taskcluster against BuildBot. |
@jugglinmike does the script for this exist yet? I'm sort of eager to turn off ci_stability.py in Travis :) |
The script is available in a standalone project which includes some documentation on installation and running: https://github.com/bocoup/wpt-validate-taskcluster The initial results were a little shaky, so I reported them to the Taskcluster maintainers. They got back to me yesterday. The biggest takeaway is that that many of those "shaky" results were due to stragglers--pull requests that were based on a commit that pre-dated our change to enable Taskcluster. A primitive way to keep that problem from influencing the data is to limit the query to the past 3 weeks.
That also limits the data, but we'll still have close to 500 commits to work with. All discrepancies
In summary:
In other words, of the commits which were validated by TravisCI and/or Taskcluster, 17% had discrepancies between the two systems. I've classified those 81 discrepancies, but doing so involved some additional interpretation, so bear with me:
✅ TravisCI missed the commit (1 commit)
❌ GitHub missed the results (2 commits)
❌ Taskcluster timed out (5 commits)
✅ Missing the "event filtering" fix (4 commits)
✅ Closed before Taskcluster completed (4 commits)
✅ Taskcluster not enabled on base branch (3 commits)
❌ Taskcluster missed the commit (26 commits)
✅ Stability discrepancy - Taskcluster **found** instability that TravisCI missed (26 commits)
✅ Stability discrepancy - Taskcluster **missed** instability that TravisCI found (10 commits)
So if you buy my hand-wavy excuses, only 33 commits over the past three weeks (i.e. 7%) are concerning. Next steps 7% seems pretty good, but we probably want to do better than that. "GitHub missed the results" This is rare but still worth understanding. It didn't come up in my original report to the Taskcluster team, so I've updated that bug with some information along those lines. "Taskcluster timed out" We set our own limits for maximum build duration. We can certainly set it higher, but I personally don't have a sense for what value is appropriate. "Taskcluster missed the commit" These are the most concerning, but they're also the hardest to investigate. Missed commits were my primary motivation for bugging the Taskcluster maintainers since I assumed they would have access to additional logging information. Their feedback already helped us understand some of the discrepancies, but they're not done! They classified 3 of the commits I originally reported team as "mysterious." They've improving logging, and they're continuing the investigation, so we could stand to wait a little longer to see if that turns up any new information. Script used to identify "stragglers" (for enabling validation)#!/bin/bash
pr_validation_enabled=1ede22e333195e47e13260a58ce89ef3220c19c7
values='
13907,af0b3812c550cc3fabd7c39da5f057a0d4a94556
13906,1937a04bf6b3c183d0fcd53e552fe9ecb969119b
13905,ff33b7ebc5772eed2bbe415f832cb9173cebb8f5
13904,92297ec317f6be7222ccca4a095e345e8cb9dd14
13886,34a8296fc37dc68681334aaded3f5609eee5b555
13884,bf194907e3d1f2676bc0314bc2dc1020b8fefe70
13862,fc92447525ca55ea8f21d26f327f541d5d169025
13842,38deacc55a3424561011ba5e981c30ddf449758b
13842,b9a47dc778f80c237cbbdede6243b29b470e2948
13823,5bf797d1753108033e6f8694cefee901ce240ac4
13819,1a925d7b1c2eaf33db23cbde57ccd12c8b49629c
13781,0a36538808bea9c77624a43219091880d908a578
13781,7548d3c76a9e8c05573dcb9898eb0911d851d971
13748,3c50079362e3dc0855fafd30b5e1d66056910e0d
13744,6d259c91b1498f145ed6fbbf6795fd46ba17e007
13738,874b6a0c4247acf57f4fd8d4ba8f75fc3766b819
13681,a0438c4b7bf34e22dc2dacd34c20f160a15fa55e
13677,18477d46836c59d24aa40259822af622f367264d
13664,6bf7422c8aaae6c86d6d4b65a2e4ba8dcb686c25
13627,a5199e740165877f316430a1fa67e0b8bdde7d88
13616,e28b31f5671fa68392b47dd33895bf3fb34f1c6a
13616,a823191597d9aca7b96091bd66c88f007ef711ef
13616,ed322df2d11feec5946baa9fa115cd60118021dc
13616,659a97e7c1f7ca640ebfdc147c514825f83e56bf
13616,2e263dfb07b25a88e66d76994e8d04da1786a4a8
13606,267f7dcb3019ca3a4483bc81de7270e37e0d7421
13589,267f7dcb3019ca3a4483bc81de7270e37e0d7421
13556,40dfbac0dd1d7d75cf06665430b41b810dcf8dfe
13518,78f5a144ac27066228d688a347608b0775c12d1c
'
for value in ${values}; do
pr_number=$(echo ${value} | cut -d ',' -f 1)
sha=$(echo ${value} | cut -d ',' -f 2)
git fetch upstream refs/pull/${pr_number}/head > /dev/null 2>&1
if ! git merge-base --is-ancestor ${pr_validation_enabled} ${sha}; then
echo Pull request based on outdated master: https://github.com/web-platform-tests/wpt/pulls/${pr_number} $sha
fi
done Script used to identify "stragglers" (for filtering events)#!/bin/bash
events_filtered=4f2038853851cb7c58986bc0c93478efa9ce1a62
values='
13827,c0592a41661e70dde422e5626d21375fc578e3c3
13767,995cdc7ad9f5beeaf35fc0ea01e52e5417feca56
13603,f01a6815810145bbef6f41b9467bb00a750e09b8
13566,2ee886e56916352276a7674ebcdaf8f8dbd561e5
13550,dffb31352e8110fd65d7ce247e7fcafe4326ad5f
13548,ade02ef5c6f917d6d1d491f62304804704a040fc
13517,7495dc272f0abe8c610f3376d776d388d784dba0
13511,2d873920755ce9ccbd07bfc7ea2b9cafbdf9f193
'
for value in ${values}; do
pr_number=$(echo ${value} | cut -d ',' -f 1)
sha=$(echo ${value} | cut -d ',' -f 2)
git fetch upstream refs/pull/${pr_number}/head > /dev/null 2>&1
if ! git merge-base --is-ancestor ${events_filtered} ${sha}; then
echo Pull request based on outdated master: https://github.com/web-platform-tests/wpt/pulls/${pr_number} $sha
fi
done |
@jugglinmike, do you think your conclusions support #14033 or at reversing the "dry run" vs. "for realz" roles of Travis vs. Taskcluster as suggested in #14033 (comment)? |
There are a few unknowns that make me reluctant to answer:
I was content to wait for an answer to the former, but we can answer the latter ourselves, and that may be all we need to know. ...which made me realize my report above made a faulty assumption: that when TravisCI fails, it is always for legitimate reasons. We know that's not true, so I've updated the script to report build errors distinctly from stability failures. For the same set of commits (i.e. those tested by TravisCI and/or Taskcluster between 2018-10-15 and 2018-11-06), 40 produced build errors. I've included them in a new version of the report. When judging the correctness of the stability job, we've been using TravisCI as our source of truth. To err on the side of caution, I've removed the relevant commits from the results regarding "stability discrepancies". However, I've maintained those commits in the other results (e.g. "Taskcluster missed the commit") because they are an issue regardless of the corresponding behavior in TravisCI. ✅ TravisCI reported a build error (40 commits)
✅ TravisCI missed the commit (1 commit)
❌ GitHub missed the results (2 commits)
❌ Taskcluster timed out (5 commits)
✅ Missing the "event filtering" fix (4 commits)
✅ Closed before Taskcluster completed (4 commits)
✅ Taskcluster not enabled on base branch (3 commits)
❌ Taskcluster missed the commit (26 commits)
✅ Stability discrepancy - Taskcluster **found** instability that TravisCI missed (19 commits -- down from 26)
✅ Stability discrepancy - Taskcluster **missed** instability that TravisCI found (5 commits -- down from 10)
This means:
Keep in mind that the failure mode is substantially different between the two systems. When TravisCI fails, it does so with a discoverable log file (good) that takes a very long time to produce (bad). When Taskcluster fails, it most commonly does so silently (bad) but in a way that is immediately apparent (good). Contributors can recover from Taskcluster failures by pushing empty commits to their feature branch, but they have no recourse for TravisCI failures. That doesn't sway me, though. Not only is a silent CI failure a poor experience for contributors, it's a risk to the health of the project. Reviewers may not recognize that one of the project's "checks" is not running, but they will be able to merge as soon as the others have passed. This will be more concerning when/if we move more work to Taskcluster (e.g. linting, infrastructure tests). At this point, it doesn't look like Taskcluster is better or worse than TravisCI. "Differently Concerning" happens to be the title of by debut LP, but it also seems fitting here. Personally, I think we need a more significant reduction in error rate to justify the new risk. |
Taskcluster missed 3 commits over the past seven days:
This was over 112 commits, so it represents a 3% failure rate. The sample size is smaller, though, and I don't have the math to determine statistical significance. It's still good news, though, because shortly after I reported those results last week, the Taskcluster team improved their internal logging. I've submitted a new bug to the Taskcluster team--fingers crossed that we'll get some more clarity from the new data. |
@jugglinmike other than the 3% failure rate, which I think we can tolerate if it's being improved, is there anything else that would favor continuing with Travis instead of making the Taskcluster stability runs blocking? |
With #7660 closed, as part of this task should also include finally removing the stability jobs from Travis entirely, which should free up some capacity. |
3% only describes the performance over the past seven days. We have no reason to expect any improvements yet, so a more accurate picture should include the previously-reported data. That indicates a 6% failure rate.
I'm not convinced that silently allowing 6% of commits to go unverified is acceptable. The acceptance of gh-14096 shows a difference of opinion. I was hoping to get some clarity around this from the new project governance (it's why I requested a status update earlier this week), but it looks like we need to move on this before that's active. |
I think we can find out what happens if a required status check is missing; if that blocks landing then we don't get 5±2% of PRs landing whilst missing status checks, but need action in those cases. I think that's comparable to the rate of PRs that need admin access due to travis having errors that TaskCluster won't have (and in almost all cases the admin action for travis problems has been "merge and assume things are fine", so we already have some percentage of PRs missing status checks). |
In my view, #7660 has been a low-level emergency since I filed it over a year ago. I've had to work around it and apologize about it in blocked PRs over and over and over again. It has occasionally detected flaky tests, which I haven't counted and made a long list of, but certainly my experience has been more occurrences of trouble than catching of real problems. Given that, as long as the Taskcluster isn't much worse than Travis at catching real problems, it seems like a great trade. I'm glad I got two people agreeing on #14096 :) |
I think the difference here is how valuable we're guessing the stability checks have been. I have felt them to be a net negative for the past year but knew I couldn't get away with just disabling them. Assuming that flakiness is detected one something like maybe 0.1% or 1% of PRs, if we did miss 6% of those it wouldn't end up causing much trouble in practice, 94% of the trouble already averted! There's also the empirical question of whether making a check required blocks the PR if the check never starts. Hopefully we'll learn the answer very soon. If it doesn't block I guess we should pester both the Taskcluster team to react to the webhooks better, and GitHub to block PRs if expected required checks haven't started. |
I agree that if the Checks API requires a response, then the behavior we're seeing from Taskcluster is far more acceptable. And I recognize that I have not had to help folks understand irrelevant TravisCI failures, so I probably underestimate the value of disabling those jobs. |
@foolip I've seen that we once did those stability tests using Sauce Labs. Is there any value to bring this back now? I can provide the necessary account capabilities so we have enough concurrency to efficiently run these tests. |
@christian-bromann unfortunately we had to remove that when JWT was deprecated, see #9903. The trouble we'd face if trying to add it back is that we can't put any secrets in our CI setup, because they can be extracted with anyone with write access to the repo. Or, to put it differently, however we authenticate we have to be OK with that being available to hundreds of people and being possible for them to extract and use elsewhere. I have no reason to think anyone would, but we'd have to accept the risk. |
Also, managing our own infrastructure has generally worked better and been more flexible; our needs are not always straightforward and it's useful to be able to run arbitary code rather than just having WebDriver access. With a combination of TaskCluster and Azure we covered all our requirements except Edge. If we could make Edge work on Sauce without the problems @foolip mentioned, that would be helpful, although I expect we would still prefer to move to Edge on another system (likely Azure) giving us OS-level access if we can get Microsoft to arrange cloud access to Windows 10. |
…is and report them on Taskcluster, a=testonly Automatic update from web-platform-testsAllow stability-checker failures on Travis and report them on Taskcluster (#14096) Fixes web-platform-tests/wpt#7660. web-platform-tests/wpt#13194 remains open to verify that Taskcluster isn't failing to detect flakiness that Travis would. -- wpt-commits: bcd2471a16325fada550e05dad2b17895a232c6a wpt-pr: 14096
…is and report them on Taskcluster, a=testonly Automatic update from web-platform-testsAllow stability-checker failures on Travis and report them on Taskcluster (#14096) Fixes web-platform-tests/wpt#7660. web-platform-tests/wpt#13194 remains open to verify that Taskcluster isn't failing to detect flakiness that Travis would. -- wpt-commits: bcd2471a16325fada550e05dad2b17895a232c6a wpt-pr: 14096
I've filed #14165, turns out missing required checks are blocking, and Taskcluster isn't reliably starting. |
…is and report them on Taskcluster, a=testonly Automatic update from web-platform-testsAllow stability-checker failures on Travis and report them on Taskcluster (#14096) Fixes web-platform-tests/wpt#7660. web-platform-tests/wpt#13194 remains open to verify that Taskcluster isn't failing to detect flakiness that Travis would. -- wpt-commits: bcd2471a16325fada550e05dad2b17895a232c6a wpt-pr: 14096
gh-14096 changes the way Taskcluster statuses should be interpreted. I've updated the script to account for this and ran it through the period since my previous report (from 2018-11-15 to 2018-11-21): Full listing of discrepancies
Summary
Taskcluster missed 13 of the 111 commits which were validated by Taskcluster. While a failure rate of 12% is much higher than our previous estimate of 6%, bear in mind that our sample size continues to shrink. |
I spoke too soon. The following commits were validated correctly:
Further, one of the missing results is due to a commit that was force-pushed about an hour ago, so the pending task is not a concern:
This makes for a failure rate of 6% over the past 6 days, which is in-line with our expectations. I've updated the Taskcluster bug that I opened last week with the list of new misses. (Verifying Taskcluster results for commits created in this time frame is more complicated than it might seem. We can't use a simple time-based heuristic because Taskcluster's behavior is determined by the state of the |
…is and report them on Taskcluster, a=testonly Automatic update from web-platform-testsAllow stability-checker failures on Travis and report them on Taskcluster (#14096) Fixes web-platform-tests/wpt#7660. web-platform-tests/wpt#13194 remains open to verify that Taskcluster isn't failing to detect flakiness that Travis would. -- wpt-commits: bcd2471a16325fada550e05dad2b17895a232c6a wpt-pr: 14096
Closing this, we've already switched to Taskcluster. A number of issues have already been found, many fixed, and we should file more if we see things being missed. |
…is and report them on Taskcluster, a=testonly Automatic update from web-platform-testsAllow stability-checker failures on Travis and report them on Taskcluster (#14096) Fixes web-platform-tests/wpt#7660. web-platform-tests/wpt#13194 remains open to verify that Taskcluster isn't failing to detect flakiness that Travis would. -- wpt-commits: bcd2471a16325fada550e05dad2b17895a232c6a wpt-pr: 14096 UltraBlame original commit: e758e29e8a1a9cfc2aea0d2fd6eace425676ec0d
…is and report them on Taskcluster, a=testonly Automatic update from web-platform-testsAllow stability-checker failures on Travis and report them on Taskcluster (#14096) Fixes web-platform-tests/wpt#7660. web-platform-tests/wpt#13194 remains open to verify that Taskcluster isn't failing to detect flakiness that Travis would. -- wpt-commits: bcd2471a16325fada550e05dad2b17895a232c6a wpt-pr: 14096 UltraBlame original commit: 3cec1475937580e33ee98d3bae84c8c7f4d5e862
…is and report them on Taskcluster, a=testonly Automatic update from web-platform-testsAllow stability-checker failures on Travis and report them on Taskcluster (#14096) Fixes web-platform-tests/wpt#7660. web-platform-tests/wpt#13194 remains open to verify that Taskcluster isn't failing to detect flakiness that Travis would. -- wpt-commits: bcd2471a16325fada550e05dad2b17895a232c6a wpt-pr: 14096 UltraBlame original commit: e758e29e8a1a9cfc2aea0d2fd6eace425676ec0d
…is and report them on Taskcluster, a=testonly Automatic update from web-platform-testsAllow stability-checker failures on Travis and report them on Taskcluster (#14096) Fixes web-platform-tests/wpt#7660. web-platform-tests/wpt#13194 remains open to verify that Taskcluster isn't failing to detect flakiness that Travis would. -- wpt-commits: bcd2471a16325fada550e05dad2b17895a232c6a wpt-pr: 14096 UltraBlame original commit: 3cec1475937580e33ee98d3bae84c8c7f4d5e862
…is and report them on Taskcluster, a=testonly Automatic update from web-platform-testsAllow stability-checker failures on Travis and report them on Taskcluster (#14096) Fixes web-platform-tests/wpt#7660. web-platform-tests/wpt#13194 remains open to verify that Taskcluster isn't failing to detect flakiness that Travis would. -- wpt-commits: bcd2471a16325fada550e05dad2b17895a232c6a wpt-pr: 14096 UltraBlame original commit: e758e29e8a1a9cfc2aea0d2fd6eace425676ec0d
…is and report them on Taskcluster, a=testonly Automatic update from web-platform-testsAllow stability-checker failures on Travis and report them on Taskcluster (#14096) Fixes web-platform-tests/wpt#7660. web-platform-tests/wpt#13194 remains open to verify that Taskcluster isn't failing to detect flakiness that Travis would. -- wpt-commits: bcd2471a16325fada550e05dad2b17895a232c6a wpt-pr: 14096 UltraBlame original commit: 3cec1475937580e33ee98d3bae84c8c7f4d5e862
In gh-12657, we configured TaskCluster to validate patches submitted via pull request on GitHub.com. Typically, failing validation jobs prevent reviewers from merging pull requests. The new TaskCluster integration was unproven, and we recognized that it may have bugs. To avoid interrupting ongoing work with spuriously-failing jobs, we integrated it to allow failures.
TravisCI has been running equivalent jobs for a far longer time; we persisted those jobs so that we would continue to receive notifications for errors in pull requests.
The TaskCluster integration has been running for over a week, so we should have enough information to judge its correctness.
Collect the results of the TaskCluster tasks and TravisCI "pull request" jobs since gh-12657, and compare them. Note that due to subtle differences in the tasks, a certain kind of discrepancy is acceptable.
The text was updated successfully, but these errors were encountered: