Show history of a specific test #1495

foolip · 2019-09-18T02:20:34Z

If a test is failing, or seems to be flaky, then seeing the history of a test going far back in time would be very useful. Only transitions are really interesting, so visually you'd want to collapse lots of runs that have the same results.

This would be extra important if we can upload results of manual tests, where one might not even know where the most recent run can be found, but looking at history from a MISSING result could be helpful.

Aside: following renames would be nice, to not create a disincentive to renaming.

foolip · 2019-09-18T02:23:14Z

@fantasai @frivoal @plinss @tabatkins

foolip · 2022-02-21T10:18:07Z

This was incredible painful when filing web-platform-tests/wpt#32925. We have a "show history" button on wpt.fyi, but it doesn't really work. I ended up having to tweak a command line script to output run IDs and hunt down where a test first began failing.

What I would like is something akin to GitHub's "show history" listing but for test results, not the test content. This could be a single list of changes, which could be filtered to a specific browsers.

One challenge is what to do with flaky tests, if it will be necessary to detect and collapse flakiness. This is a nice-to-have though, a history view would be useful without it.

This changes the overall scores. Chrome/Firefox/Safari scores were 72/74/72 before, and are now 71/74/73. This is because dialog and forms scores have changed. For dialog, it's because two tests have been dropped: web-platform-tests/wpt-metadata#2519 For forms, it's because of a flaky test: #1495

gsnedders · 2022-02-22T17:13:12Z

By way of comparison, https://results.webkit.org/?suite=layout-tests&test=imported%2Fw3c%2Fweb-platform-tests%2Fhtml%2Fsemantics%2Fforms%2Fthe-input-element%2Fshow-picker-cross-origin-iframe.html

In some ways, the current history view is much too small to be particularly useful. A larger view, with more history, would probably be more useful?

foolip · 2022-02-22T17:43:54Z

That's pretty nice! And it only makes 29 requests, unlike wpt.fyi which has to make hundreds of requests, basically one per red/green square. That's because we don't have the data in the right form to make this efficient.

To solve the use cases I usually have, a view would have to go back at least a few months. That's hundreds of runs in common cases, so a big grid with one cell per run might be too much. But maybe it would be good enough as a start, even without trying to collapse long runs of the same result.

gsnedders · 2022-02-22T18:50:57Z

That's pretty nice! And it only makes 29 requests, unlike wpt.fyi which has to make hundreds of requests, basically one per red/green square. That's because we don't have the data in the right form to make this efficient.

Right, any sane approach here requires server-side support. Thankfully, we also control the server-side. 🙃

foolip · 2022-02-23T13:50:25Z

I was thinking about what the storage here has to look like. One way to represent all of our results would be in a giant denormalized database with many billions of rows, one per test/subtest result. (Ecosystem Infra investigated that and found that wouldn't be fast enough to query with any db we tried.)

One could transpose that database to be able to query all the information about a specific test/subtest for all runs, but that would be an equally gigantic database.

So is there a way to exploit the massive duplication (similarity) of results over time similar to wpt-results? I think that would have to be a graph of test paths (similar to MANIFEST.json trie) where the leaves represent a set of (run_id, status) pairs in some clever way.

What is that clever way? The first that comes to mind is to map run_ids to lots of consecutive integers, in the order that would result in the most long runs of the same status. Then the (run_id, status) pairs could be turned into (run_id_range, status), some kind of run-length encoding. Flaky tests would make this data bigger.

Every time I think about these things I figure there must be off-the-shelf solutions that are better since our use cases aren't that unique, but I'm not clever enough to find them.

tabatkins · 2022-02-24T19:28:02Z

Yeah, so long as we maintain a mapping from run-id to an incrementing integer according to date of generation, we can definitely collapse long, long runs of data into a range, so we're only storing one entry per change in status for a given subtest/browser pair, which for most tests will mean just a handful of rows. I don't think we even need to be clever about the mapping - we can just make a list.

This, by itself, will drop the data size by a factor of thousands. We should just try this out and see if it's sufficient for our needs before we try getting cleverer.

luser · 2022-02-24T20:38:55Z

Mozilla's Treeherder has an intermittent failures dashboard that tracks flaky tests, the display for a single issue is a graph + table of failures.

jgraham · 2022-03-17T12:16:54Z

Mozilla's infrastructure is doing something a bit different here. It's basically a human-curated mapping between observed log lines (often, but not always, corresponding to test failures) and bugs. It allows answering questions like "how often did we see the failure line corresponding to this bug in the last week". But it has a number of problems; it doesn't really know about tests at all and can't answer questions like "how often has this test actually been run". Also the manual curation doesn't really scale, so if you have an intermittent test that you can't fix immediately your options are basically: a) disable the test (which in practice means "forever") or b) find a solution to avoid printing a failure line for the intermittens.

For web-platform-tests the frequent syncs with upstream mean we're often in the situation described where there's a test that we know is flaky but can't reliably fix it before landing the sync. Historically we used option a) above: disable tests in that case. But that's pretty heavyhanded; if the test is fixed to no longer be flaky we'll never notice. More recently we we switched to optin b) above: list the known intermittent statuses and don't produce a failure line in the logs as long as we get one of those. That's already better; it means that if a test that previously failed now starts to crash we'll notice. And it also means that if the test gets e.g. more subtests added that aren't flaky we'll still run those even if the overall test has an intermittent failure. But it's far from perfect. Without data about how often the test is run, and what the outcomes were, we can't automatically remove flaky annotations, even if the flake never happens. Given that any test might flake very occasionally due to external factors, the long term consequence is likely to be many tests marked as intermittent which are in fact producing stable results the vast majority of the time. This will then stop us detecting real regressions.

So to fix that it would be ideal to have a system that could consume the results of test runs (maybe over a limited time window like 30 days) and, in conjunction with the expectation metadata, answer the question "which tests that are marked as flaky in fact had consistent results over the given time period". Then we could update the expectation metadata to remove flaky annotations corresponding to things that no longer happened. In the bad case where things went from [PASS, FAIL] to FAIL this would give us a regression window to investigate and figure out if it was caused by a test change or a product change.

I note that the above requirements could pretty much be satisfied if you had a sequence of runs and for each test you just stored something like {status: last_run_id} i.e. instead of having a way to compute the status of each run, you'd just store enough to know whether you saw a particular status in a given time interval. Of course that would be less flexible and unsuited to answering other questions like "how flaky is this test".

Another relevant concern for gecko is that we aren't running the tests in a single configuration; we have tens of different configurations that can have different associated statuses. For example a test might be flaky , but only on Linux 32, or only on Windows 64 with fission disabled, or similar. So each result also has to be associated with a specific run configuration.

This was referenced Jun 8, 2023

[WPT Test History] Create an API endpoint and document it #3352

Closed

[WPT Test History] Frontend dashboard design #3353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show history of a specific test #1495

Show history of a specific test #1495

foolip commented Sep 18, 2019

foolip commented Sep 18, 2019

foolip commented Feb 21, 2022

gsnedders commented Feb 22, 2022

foolip commented Feb 22, 2022

gsnedders commented Feb 22, 2022

foolip commented Feb 23, 2022

tabatkins commented Feb 24, 2022

luser commented Feb 24, 2022

jgraham commented Mar 17, 2022

Show history of a specific test #1495

Show history of a specific test #1495

Comments

foolip commented Sep 18, 2019

foolip commented Sep 18, 2019

foolip commented Feb 21, 2022

gsnedders commented Feb 22, 2022

foolip commented Feb 22, 2022

gsnedders commented Feb 22, 2022

foolip commented Feb 23, 2022

tabatkins commented Feb 24, 2022

luser commented Feb 24, 2022

jgraham commented Mar 17, 2022