Migrate our core representation to the typed choice sequence #3921

tybug · 2024-03-15T02:20:57Z

This epic-style issue tracks our work on migrating the internals of Hypothesis from the byte sequence to the typed choice sequence.

Byte Sequence

Hypothesis currently works at the level of a "sequence of bytes".

Strategies generate values by parsing bytes from the sequence when making random (arbitrary) decisions.
The database stores inputs as their byte sequence representation.
The shrinker shrinks inputs by shrinking their corresponding byte sequence ("internal shrinking").
To generate not-yet-seen inputs, generates a not-yet-seen prefix of bytes (via DataTree internally).

But, the byte sequence has its limitations.

Redundancy. The mapping of bytes ↦ input is not injective, so an input may have many byte representations. For instance, 0 is represented by many different byte sequences, so any strategy using st.integers() effectively wastes some number of inputs (except for detecting flakiness). See also generate_novel_prefix interacts poorly with biased_coin (and lists) #1574.
Precision. The shrinker (and other operations over the byte sequence), lack detailed information about the input. For instance, we would like to be able to "naturally" shrink floats, eg by exponentially ramping up division or truncation. Doing this in the byte representation is quite nasty. In fact, we currently hack around this by parsing bytes which look like they could represent floats into a float, shrinking that, and serializing it back into the byte sequence.

These limitations extend equally to alternative backends over the byte sequence. e.g. for CrossHair (#3086) integration, bytes is simply too low level to get any efficient algorithm out of.

Typed Choice Sequence

Enter the typed choice sequence, which replaces the byte sequence. We lift up the representation from bytes to a slightly higher representation at the level of five types: boolean, integer, float, string, and bytes.

class PrimitiveProvider(abc.ABC):
    @abc.abstractmethod
    def draw_boolean(
        self,
        p: float = 0.5,
    ) -> bool:
        ...

    @abc.abstractmethod
    def draw_integer(
        self,
        min_value: int | None = None,
        max_value: int | None = None,
        *,
        # weights are for choosing an element index from a bounded range
        weights: Sequence[float] | None = None,
        shrink_towards: int = 0,
    ) -> int:
        ...

    @abc.abstractmethod
    def draw_float(
        self,
        *,
        min_value: float = -math.inf,
        max_value: float = math.inf,
        allow_nan: bool = True,
        smallest_nonzero_magnitude: float,
    ) -> float:
        ...

    @abc.abstractmethod
    def draw_string(
        self,
        intervals: IntervalSet,
        *,
        min_size: int = 0,
        max_size: int | None = None,
    ) -> str:
        ...

    @abc.abstractmethod
    def draw_bytes(
        self,
        min_size: int = 0,
        max_size: int | None = None,
    ) -> bytes:
        ...

which improves redundancy (as DataTree operates at this higher level) and precision (as we retain type and shape information about what was previously spans of the byte sequence).

Roadmap

Completed:

Future work, roughly in order of expected completion:

finish lower_blocks_together in the shrinker
migrate ParetoOptimiser
define and use a shrink ordering for the ir
migrate database to serialized ir

The text was updated successfully, but these errors were encountered:

JonathanPlasse · 2024-03-15T02:59:41Z

This is super interesting!
Thank you for writing this detailed issue.
I would like to get involved with hypothesis.
What would constitute a good first contribution here?

Zac-HD · 2024-03-15T05:56:38Z

Welcome, Jonathan! We'd love to have you continue contributing - I already really appreciate the type-annotation-improvements for our numpy and pandas extras, so this would be a third contribution 😻

@tybug might have some ideas here, but my impression is that the "refactor for an IR" project in this issue is more-or-less a serialized set of tasks and so adding a second person is unlikely to help much - even with just one we've had a few times where there were two or three PRs stacked up and accumulating merge conflicts between them.

As an alternative, #3764 should be a fairly self-contained bugfix. On the more ambitious side, #3914 would also benefit from ongoing work on that - testing, observability, reporting whatever bugs you surface, etc. Or of course you're welcome to work on any other open issue which appeals to you!

JonathanPlasse · 2024-03-15T06:13:12Z

Thanks, I will start with #3764 and then take on the different issue on #3914.

Zac-HD · 2024-03-15T08:23:44Z

We may still use the bitstream representation for some things (database?).

I was thinking that we'd still serialize to a bytestring - that's the ultimate interop format, and when we need to handle weird unicode and floats like subnormals or non-standard bitpatterns for nan I don't want to trust whatever database backend our users cook up to round-trip correctly. Existing formats like protobuf or msgpack all have constraints like "unicode strings must be valid utf-8" or "numbers limited to bits", so I wrote a custom serializer instead 🙂

tybug · 2024-03-15T18:29:41Z

yeah, this is a hard one to parallelize 😄. Some of the steps may subtly depend on others in ways that aren't obvious until one is knee deep in implementing it.

so I wrote a custom serializer instead 🙂

Nice! I agree with the reasoning here. Added a task for this. This probably needs to be the absolute last thing to switch to the ir.

Zac-HD · 2024-03-16T01:51:45Z

Definitely the last thing to switch, I just got nerdsniped 😅

tybug · 2024-03-20T02:17:18Z

I'm working on migrating shrinker block programs. Our upweighting for large integer ranges is giving the shrinker trouble, because it means that a simpler tree can result in a longer buffer: the buffer runs through the weighted distribution and draws n bits from some small bucket, while the tree runs through the uniform distribution (as a result of forced=True) and draws m > n bits, where the difference in m and n is large enough that it offsets whatever simplification is made by the tree.

Real example of this:

b1 = b'\x01\x00\x01\x00\x00\x00\x01\x00\x01\x00\x00\x00\x01\x00\x01\x00\x00\x00\x01\x00\x01\x00\x00\x00\x01\x00\x01\x00\x00\x00\x00'
b2 = b'\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00'
s = st.lists(st.integers(0, 2**40))

print("complex result, smaller buffer", ConjectureData.for_buffer(b1).draw(s))
# complex result, smaller buffer [0, 0, 0, 0, 0]
print("simpler result, larger buffer", ConjectureData.for_buffer(b2).draw(s))
# simpler result, larger buffer [0, 0, 0, 0]

As a result I'd like to look at moving that weighting logic into IntegerStrategy, which imo is where it logically belongs anyway, not at the ir layer. To accommodate this with weights, we'll need a structure that can express weights for entire ranges, not just "weight points and everything else is uniform". What do you think of weights=[(a, b, p), ...] where union((a, b), ...) == [min_value, max_value], sum(p) == 1, and len((a, b), ...) <= 255?

Zac-HD · 2024-03-20T04:17:20Z

What if we forced even more instead?

If we choose a smaller bits size, instead of drawing the main value from a narrower range we draw a value-to-force from the narrower range, and then force-draw it from the full range. The choice of fewer bits is then cleanly deletable without changing the interpretation of subsequent bits.

tybug · 2024-03-20T04:28:03Z

We could do that! I'm fairly confident exactly what you stated, or some small variation, would work.

I was thinking of killing two birds with one stone here, though. Do you think the upweighting belongs in the ir or in st.integers()? If we're going to move it out of the ir eventually anyway, I think now is the right time to do it, both while it's causing problems and we're changing the weights interface.

Zac-HD · 2024-03-20T06:33:01Z

I think doing it 'below' the IR, so we just represent a single integer value with a minimum of redundancy, is the principled approach here. "Literally just give me an integer" feels like it should be bijective 😅

tybug · 2024-03-20T16:59:37Z

The concern is that moving the weighting to st.integers() will result in drawing an integer correspond to more than one ir draw? I think we can avoid this via weights (and wouldn't want to move the weighting if we couldn't). I was thinking of something like this, where we combine the probability distributions upfront and pass it to weights. We wouldn't need to draw a boolean with p=7/8. Probability computations are pseudocode for whatever representation we use.

class IntegersStrategy(SearchStrategy):

    ...

    def do_draw(self, data):

        weights = None
        if self.end is not None and self.start is not None:
            bits = (self.end - self.start).bit_length()

            # For large ranges, we combine the uniform random distribution from draw_bits
            # with a weighting scheme with moderate chance.  Cutoff at 2 ** 24 so that our
            # choice of unicode characters is uniform but the 32bit distribution is not.
            if bits > 24:
                def weighted():
                    # INT_SIZES = (8, 16, 32, 64, 128)
                    # INT_SIZES_SAMPLER = Sampler((4.0, 8.0, 1.0, 1.0, 0.5), observe=False)
                    total = 4.0 + 8.0 + 1.0 + 1.0 + 0.5
                    return (
                        (4.0 / total) * (-2**8, 2**8),
                        # ...except split these into two ranges to avoid double counting bits=8
                        (8.0 / total) * (-2**16, 2**16),
                        (1.0 / total) * (-2**32, 2**32),
                        (1.0 / total) * (-2**64, 2**64),
                        (0.5 / total) * (-2**128, 2**128),
                    )
                weights = (
                    (7 / 8) * weighted()
                    + (1 / 8) * uniform()
                )

            # for bounded integers, make the near-bounds more likely
            weights = (
                weights
                + (2 / 128) * self.start
                + (1 / 64) * self.end
                + (1 / 128) * (self.start + 1)
                + (1 / 128) * (self.end - 1)
            )
            # ... also renormalize weights to p=1, or have the ir do that

        return data.draw_integer(
            min_value=self.start, max_value=self.end, weights=weights
        )

Now the ir draw_integer is truly uniform, but st.integers() keeps the same distribution as before.

Zac-HD · 2024-03-20T21:52:50Z

That would work! I'm also fine with the IR draw_integer remaining non-uniform above 24 bits, if that's easier.

tybug · 2024-09-09T06:01:41Z

I'm working on a native ordering for the IR (wip branch). My current plan is to have a bijective map ir_ordering: (value: IRType) <-> (order: int). The order depends on the kwargs of the node and order = 0 indicates the simplest value for that node.

This will replace some ad-hoc constructs:

node.trivial becomes ir_ordering(node, to="index") == 0
all_children becomes for i in range(compute_max_children(node)): yield ir_ordering(i, to="value")

We can also take advantage of this ordering as a unified representation to work over when convenient, just like the bytestring was. I plan to use this ordering to migrate Optimiser to the IR until/if we add ir-specific mutations, and to replace our shrinker misalignment logic, which currently uses the bytestring as an intermediary.

Two things:

I'm very open to feedback on this structure or plan in general, though I'm also relatively confident in this plan in the absence of any feedback
I have stopped work here at defining the ordering on floats (and likely won't return for a few weeks). This is our opportunity to redefine the ordering on floats, free from any shrinking or byte restrictions. I think there are some things here that don't make sense:

>>> sorted([0.01 * n for n in range(100)], key=float_to_lex)
[0.0, 0.5, 0.75, 0.8300000000000001, 0.54, 0.79, 0.71, 0.96, 0.52, 0.77, 0.56, 0.81, 0.73, 0.98, 0.51, 0.76, 0.72, 0.97, 0.55, 0.8, 0.53, 0.78, 0.74, 0.99, 0.5700000000000001, 0.8200000000000001, 0.59, 0.84, 0.67, 0.92, 0.63, 0.88, 0.61, 0.86, 0.6900000000000001, 0.9400000000000001, 0.65, 0.9, 0.68, 0.93, 0.6, 0.85, 0.64, 0.89, 0.7000000000000001, 0.9500000000000001, 0.62, 0.87, 0.58, 0.66, 0.91, 0.25, 0.27, 0.48, 0.26, 0.28, 0.49, 0.38, 0.36, 0.4, 0.39, 0.37, 0.41000000000000003, 0.42, 0.46, 0.44, 0.43, 0.47000000000000003, 0.45, 0.34, 0.3, 0.32, 0.35000000000000003, 0.31, 0.29, 0.33, 0.24, 0.13, 0.14, 0.19, 0.18, 0.2, 0.21, 0.23, 0.22, 0.17, 0.15, 0.16, 0.12, 0.07, 0.09, 0.1, 0.11, 0.08, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01]
>>>

like 0.54 < 0.1, and all of the last ~10 values not being smaller.

I guess a starting point is, how should we order [0, 1)? We don't have to go crazy with this – the ordering only has an impact insofar as the shrinker can intelligently match it¹, there are just too many floats – but @Zac-HD I'm curious if you've thought about a good ordering before 🙂. I'm tempted to say [0] + [0.1 * n for n in range(10)] + [0.05 * n for n in range(20)] + [0.025 * n for n in range(40)] + ... (ignoring duplicates), but I don't know how this would hold up against exponent/mantissa realities of floats.

I'm realizing that our ordering and shrinker are strongly decoupled in the IR, and to see benefits both need to be updated. There's no point to defining an intelligent and complicated ordering on floats if the shrinker never tries (ordering-)smaller floats. ↩

Zac-HD · 2024-09-11T02:59:40Z

this plan sounds good to me overall
I'd prefer having a pair of functions ir_node_to_index and ir_index_to_node rather than accepting a kwarg to specify the direction
gosh, doing ~anything with floats is going to suck in some ways. I think trying to work with base-ten values is basically futile. Instead, I'd represent the float x as an improper fraction a + (b / 2**c) and shrink as for the tuple (c, b, a) - meaning integers are simplest, then multiples of a half, a quarter, an eighth, etc. The interval 0 .. 2 would then be, in increasing shrink order, 0, 1, 2, 1/2, 1+1/2, 1/4, 3/4, 1+1/4, 1+3/4, 1/8, .... This jumps around the number line a bit but is fairly well-behaved in the binary-representation space, and we can make the shrinker aware of a couple of different notions of locality as needed.

tybug · 2024-11-16T21:29:30Z

An update: we've steadily been working towards this goal over the past 6 months. The current status is that input generation and redundancy tracking (via DataTree) and the shrinking algorithm are comfortably on the typed choice sequence. What remains is moving the database and the definition of a "simpler" input (currently lexicographic ordering over bytes). While these two steps still encompass a substantial amount of work, we expect that most of the major complications involved in the migration have been solved.

The typed choice sequence has been incrementally worsening performance (#4076 (comment)) as we maintain two internal representations simultaneously. We expect this to improve roughly back to the status quo once Hypothesis is fully on the TCS. And it may well already be a wash due to the increased bug-finding power of the TCS (because of e.g. less duplication).

tybug added the internals Stuff that only Hypothesis devs should ever see label Mar 15, 2024

tybug mentioned this issue Mar 16, 2024

Migrate reorder_examples to the IR #3923

Merged

Zac-HD mentioned this issue Mar 17, 2024

Fix PermissionError for inaccessible database path #3924

Merged

This comment was marked as resolved.

Sign in to view

Zac-HD mentioned this issue Mar 19, 2024

Migrate pass_to_descendant and redistribute_block_pairs shrinker passes #3929

Merged

tybug mentioned this issue Mar 21, 2024

Ideas for some strategy optimizations #3932

Closed

Zac-HD mentioned this issue Mar 21, 2024

Track and use maximum number of distinct elements a strategy can produce #2035

Closed

This was referenced Mar 31, 2024

Inquisitor sometimes fails to report arguments as freely varying #3864

Closed

finish migrating the shrinker #3945

Closed

tybug mentioned this issue Apr 8, 2024

Improve integer weight draw and flaky error message #3949

Merged

Zac-HD closed this as completed in #3949 Apr 8, 2024

tybug reopened this Apr 8, 2024

This was referenced Apr 22, 2024

Split the HypoFuzz engine into a Hypothesis backend and an executor Zac-HD/hypofuzz#36

Open

Generating non-IEEE floats - such as bfloat16, bfloat32, or various float8_* types #3959

Open

tybug mentioned this issue May 1, 2024

Migrate most shrinker functions to the ir #3962

Merged

Zac-HD added enhancement it's not broken, but we want it to be better test-case-reduction about efficiently finding smaller failing examples labels May 29, 2024

Zac-HD added the performance go faster! use less memory! label May 29, 2024

This was referenced May 29, 2024

Follow up on IR shrinking tasks #4006

Open

Remove sub-ir examples #4007

Merged

Zac-HD mentioned this issue Jul 9, 2024

Using builds arguments for reprs may produce worse results than pretty printing #4037

Closed

Zac-HD mentioned this issue Jul 20, 2024

Use tmp_path in ghostwriter tests #4050

Merged

This was referenced Jul 29, 2024

example generation regression between 6.47.0 -> 6.103.1 #4014

Open

Investigate Examples performance #4066

Open

tybug mentioned this issue Aug 7, 2024

Progress towards conjecture typing #4073

Merged

tybug changed the title ~~Migrate our core representation to an IR layer~~ Migrate our core representation to the typed choice sequence Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate our core representation to the typed choice sequence #3921

Migrate our core representation to the typed choice sequence #3921

tybug commented Mar 15, 2024 •

edited

Loading

JonathanPlasse commented Mar 15, 2024

Zac-HD commented Mar 15, 2024

JonathanPlasse commented Mar 15, 2024

Zac-HD commented Mar 15, 2024

tybug commented Mar 15, 2024

Zac-HD commented Mar 16, 2024

This comment was marked as resolved.

This comment was marked as resolved.

tybug commented Mar 20, 2024 •

edited

Loading

Zac-HD commented Mar 20, 2024

tybug commented Mar 20, 2024

Zac-HD commented Mar 20, 2024

tybug commented Mar 20, 2024

Zac-HD commented Mar 20, 2024

tybug commented Sep 9, 2024

Zac-HD commented Sep 11, 2024

tybug commented Nov 16, 2024 •

edited

Loading

Migrate our core representation to the typed choice sequence #3921

Migrate our core representation to the typed choice sequence #3921

Comments

tybug commented Mar 15, 2024 • edited Loading

Byte Sequence

Typed Choice Sequence

Roadmap

JonathanPlasse commented Mar 15, 2024

Zac-HD commented Mar 15, 2024

JonathanPlasse commented Mar 15, 2024

Zac-HD commented Mar 15, 2024

tybug commented Mar 15, 2024

Zac-HD commented Mar 16, 2024

This comment was marked as resolved.

This comment was marked as resolved.

tybug commented Mar 20, 2024 • edited Loading

Zac-HD commented Mar 20, 2024

tybug commented Mar 20, 2024

Zac-HD commented Mar 20, 2024

tybug commented Mar 20, 2024

Zac-HD commented Mar 20, 2024

tybug commented Sep 9, 2024

Footnotes

Zac-HD commented Sep 11, 2024

tybug commented Nov 16, 2024 • edited Loading

tybug commented Mar 15, 2024 •

edited

Loading

tybug commented Mar 20, 2024 •

edited

Loading

tybug commented Nov 16, 2024 •

edited

Loading