A big list of observability ideas #3845
Labels
interop
how to play nicely with other packages
legibility
make errors helpful and Hypothesis grokable
new-feature
entirely novel capabilities or strategies
Since we added our experimental observability features, there have been a steady stream of ideas tossed around. This issue tracks ideas for future work in Hypothesis itself, and/or downstream (e.g. in Tyche). Of course, there's no timeline for these ideas; nor strong reason to think that we'll ever implement them - I just don't want to forget the ideas!
Inside Hypothesis
Phase
onConjectureRunner
: currently, the active phase is defined by "what part ofConjectureRunner
is currently executing", which is fine - but it means that ourhow_generated
string just guesses the phase. We should just track this as an instance variable and then report a precise string in our observability output.Avoid pointless discards during the
reuse
andtarget
phases #3862assume()
calls or stateful preconditions or filters which never pass: previously Hypothesis doesn't tell me when a rule never assumes something successfully #213. Best approach is probably to track the number sat/unsat inmetadata
, and then aggregate and have Tyche report iff we're confident there's a problem.timing
key, but the mean calltime is probably more informative... maybe put mean (and max) calltime into features and sum in timing?status_reason
to ensure reports are actionablestatus_reason
forstatus: gave_up
test casesDownstream interfaces
Discover the time-complexity of your tests: fit various classes (log, linear, n-log-n, quadratic, exponential, ...) to the per-strategy timings from Pull out timing observations, more jsonable arguments #3834 and the
arguments
json-ifiedarguments
. Cool info line, or maybe an alert if it's really slow.This paper suggests that a two-parameter power-law fit is sufficient, but they're dealing with substantially larger inputs than Hypothesis will generate - in addition to counting basic blocks rather than durations (which we could do with
sys.monitoring
in python 3.12+). Conversely this preprint and R package just fits a few known classes to observed durations.Better user interface to data from the
explain
phase: Reporting improvements for Scrutineer #3551 should not be the best we can do. Or go further, and use thecoverage
information to provide in-editor highlighting (example from debuggingbook.com) - but note that further techniques don't seem to help (note to self: finish my essay on that).In-editor interface to apply explicit
@example
patches: it'd be neat to surface this feature to more users, and 'GUI to pick which chunks to commit' is a common tool for patch management. If needed we could emit these as an info message as well as writing them to disk.Configuration feedback to help tune settings: 'please autotune
max_examples
' is a fairly common user request. I've declined because run-for-duration risks testing far less than expected (if the test is slower than believed), but providing information for manual tuning would still be very helpful. Following Estimating Residual Risk in Greybox Fuzzing, we can estimate the number of inputs required to saturate coverage and features, and show the distribution of that over test functions. Although maybe this is better left to HypoFuzz?Explorable inputs with UMAP: dimensionality-reduction tools are a great way to explore data. Embed a coverage vector1 (or anything else!) for each example, show details on hover, color by status or runtime or arbitrary other classifications...
Footnotes
Find the set of locations which are covered by some but not all examples, and are not redundant with any other such location. Convert each unique coverage observation to a zero-or-one vector of 'covered each location', and then run UMAP with some tuned-for-this-purpose hyperparameters. ↩
The text was updated successfully, but these errors were encountered: