A big list of observability ideas #3845

Zac-HD · 2024-01-15T12:49:10Z

Since we added our experimental observability features, there have been a steady stream of ideas tossed around. This issue tracks ideas for future work in Hypothesis itself, and/or downstream (e.g. in Tyche). Of course, there's no timeline for these ideas; nor strong reason to think that we'll ever implement them - I just don't want to forget the ideas!

Inside Hypothesis

Track the current Phase on ConjectureRunner: currently, the active phase is defined by "what part of ConjectureRunner is currently executing", which is fine - but it means that our how_generated string just guesses the phase. We should just track this as an instance variable and then report a precise string in our observability output.
Avoid pointless discards during the reuse and target phases #3862
Warn for assume() calls or stateful preconditions or filters which never pass: previously Hypothesis doesn't tell me when a rule never assumes something successfully #213. Best approach is probably to track the number sat/unsat in metadata, and then aggregate and have Tyche report iff we're confident there's a problem.
Improve reported reprs for stateful testing - we have them for terminal reports, so this should "just" take a bit of plumbing to include them in observability reports too. Can we also expose timing information about which rules and invariants were executed, and how long they took to run? Easy enough to get sum-of-calltimes into the timing key, but the mean calltime is probably more informative... maybe put mean (and max) calltime into features and sum in timing?
Observability hooks for alternative backends, especially Crosshair. (implemented in Allow alternative backends to provide observability metadata #4083) As I've been working on [experimental] Run crosshair in CI #4034, I've noticed many cases where extra information would be helpful for maintainers, and could inform power-users choice of a backend. Let's discuss this over on Integrating with observability features so maintainers and users can tell what's happening pschanely/hypothesis-crosshair#22, with the goal of working out something for Crosshair which is also likely to be valuable for other backends.
Give more detail on status_reason to ensure reports are actionable
- Include "status_reason_category" in features and "status_reason_location" in metadata
- Report status_reason for overruns during shrinking, to clarify why there are so many
- (downstream) show the number of each unique status_reason for status: gave_up test cases

Downstream interfaces

Discover the time-complexity of your tests: fit various classes (log, linear, n-log-n, quadratic, exponential, ...) to the per-strategy timings from Pull out timing observations, more jsonable arguments #3834 and the arguments json-ified arguments. Cool info line, or maybe an alert if it's really slow.
This paper suggests that a two-parameter power-law fit is sufficient, but they're dealing with substantially larger inputs than Hypothesis will generate - in addition to counting basic blocks rather than durations (which we could do with sys.monitoring in python 3.12+). Conversely this preprint and R package just fits a few known classes to observed durations.
Better user interface to data from the explain phase: Reporting improvements for Scrutineer #3551 should not be the best we can do. Or go further, and use the coverage information to provide in-editor highlighting (example from debuggingbook.com) - but note that further techniques don't seem to help (note to self: finish my essay on that).
In-editor interface to apply explicit @example patches: it'd be neat to surface this feature to more users, and 'GUI to pick which chunks to commit' is a common tool for patch management. If needed we could emit these as an info message as well as writing them to disk.
Configuration feedback to help tune settings: 'please autotune max_examples' is a fairly common user request. I've declined because run-for-duration risks testing far less than expected (if the test is slower than believed), but providing information for manual tuning would still be very helpful. Following Estimating Residual Risk in Greybox Fuzzing, we can estimate the number of inputs required to saturate coverage and features, and show the distribution of that over test functions. Although maybe this is better left to HypoFuzz?
Explorable inputs with UMAP: dimensionality-reduction tools are a great way to explore data. Embed a coverage vector¹ (or anything else!) for each example, show details on hover, color by status or runtime or arbitrary other classifications...

Find the set of locations which are covered by some but not all examples, and are not redundant with any other such location. Convert each unique coverage observation to a zero-or-one vector of 'covered each location', and then run UMAP with some tuned-for-this-purpose hyperparameters. ↩

The text was updated successfully, but these errors were encountered:

Zac-HD added new-feature entirely novel capabilities or strategies legibility make errors helpful and Hypothesis grokable interop how to play nicely with other packages labels Jan 15, 2024

Zac-HD mentioned this issue Feb 2, 2024

Observability - count sat/unsat calls to assume() and stateful @precondition()s #3869

Merged

This was referenced Feb 19, 2024

Observability for stateful testing: better representation and timing info #3890

Merged

Fix a swarm-testing footgun #3894

Merged

Zac-HD mentioned this issue Mar 10, 2024

Stable support for symbolic execution #3914

Open

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A big list of observability ideas #3845

A big list of observability ideas #3845

Zac-HD commented Jan 15, 2024 •

edited

Loading

A big list of observability ideas #3845

A big list of observability ideas #3845

Comments

Zac-HD commented Jan 15, 2024 • edited Loading

Inside Hypothesis

Downstream interfaces

Footnotes

Zac-HD commented Jan 15, 2024 •

edited

Loading