Random distribution of 2-variable constraints not uniform #161

alwilson · 2022-08-18T01:00:26Z

[ Using pyvsc-0.7.6 ]

I've been curious about how uniform the random distributions for SV and pyvsc constraints are and cooked up some 2D examples to help visualize the distribution with heatmaps. The heatmaps show pyvsc making some interesting patterns rather than being flat and I was curious whether that was expected. The same constraints in SV appear to be fairly uniform. The heatmap goes from 0/black to the max value / white.

The constraints for a triangle, used in the examples below:

@vsc.randobj
class my_item_c(object):
    def __init__(self):
        self.a = vsc.rand_bit_t(32)
        self.b = vsc.rand_bit_t(32)

    # Triangle
    @vsc.constraint
    def ab_c(self):
        self.a >= 0
        self.b >= 0
        self.a <= 60 # Prevent overflow solutions
        self.b <= 60 # Prevent overflow solutions
        self.a + self.b <= 50

Triangle using pyvsc (100k randomizations, ranging from 14 to 831 hits per bucket):

Triangle using SV (100k randomizations, ranging from 49 to 105 hits per bucket):

Source for the above SV and pyvsc examples of a simple triangle constraint (x + y <= N). The dist.py script is pretty simple and has two functions to choose from for either a live animation of the randomizations or a performance measurement. The SV file and plot.py are just the same triangle and circle constraints I was testing out.
https://github.com/alwilson/pyvsc_testing

Note: I've also found that SV constraint solving is very fast (100k simple 2-var randomizations in ~2s) and that pyvsc is much slower (~2000s), which I should say I'm not surprised at, although I'd love to know better why and the differences between the solvers/generators.

I also tried putting a "hole" in the triangle constraints, which you can see in my source, but chaining constraints with or/and doesn't seem to do the same thing as SV does. Not sure if it's a pyvsc bug or if there's a different why to write out those constraints.

Thanks for developing this project! I've been having fun with it. >:)

mballance · 2022-08-18T01:13:57Z

Hi @alwilson, thanks for putting the work into creating this way to look at 2d distributions. Thus far, I've primarily been looking at 1d distributions -- and that's difficult enough in text format given a reasonably-sized domain.
I'm a bit surprised that the distribution is as skewed as it is, and will need to dig in a bit more to understand why. I find it very interesting that there appear to be hot spots (min a, max b) and (max a, min b) -- hopefully a clue that leads in a productive direction!

Again, thanks for putting together this way to visualize PyVSC distributions. Hopefully I'll shortly have some insights on how to improve the results!

qzcx · 2022-08-18T01:47:27Z

The great part about python is all the built in graphing and data crunching tools!

I've seen similar imbalances in my results, but they've been good enough for my uses up to this point.

alwilson · 2022-08-18T05:48:42Z

Had some fun with histograms and thought I'd post them here, too. I'll upload my changes once I clean things up. A good excuse for playing around with matplotlib and numpy!

pyvsc (100k samples)

python random.randrange calls (100k samples)
I rewrote the constraints as a conditional and used the random library to get something fast that's similar to what SV outputs to compare with.

- (#161) Adjust swizzling algorithm to improve multi-variable distribution Signed-off-by: Matthew Ballance <matt.ballance@gmail.com>

mballance · 2022-08-18T17:40:46Z

After reviewing the 'swizzling' implementation, I have a better understanding of the hot spots at (min,max) and (max,min). While the variables to swizzle in each solve set are randomly selected, the bits to force in each were being applied sequentially. This increased the probability that few of the bit-force constraints on the second variable would hold, allowing Boolector to repeatedly-select the same solution for the second variable.
Release 0.7.7 randomizes the order in which bit-force constraints are applied. From a visual inspection of text output, this appears to provide a good improvement in multi-variable distribution. Please give the new release a try as you're able.

alwilson · 2022-08-18T19:10:33Z

That looks better! On the histogram it looks like it's a bit more spread out and the hotspots spread out and lowered. Still doesn't have that normal distribution that I think we would want. Although it's been a while since I've thought about probability...

pyvsc triangle (100k samples)

I did want to point out that in a 1M quarter circle constraint in SV the distribution does not appear to be normal. Or at least it tends to have hotspots as well. So I'm not sure what the gold standard for pyvsc should be. Adding multiplication constraints into the mix perhaps makes it harder to guarantee?

minwooim · 2024-07-09T04:54:46Z

Hello :)

Any news on this issue ?

alwilson · 2024-07-12T03:44:15Z

I reran some distribution tests on 5f79289. I don't believe much has changed with the pyvsc's randomization and swizzler.

Here is the triangle distribution from before, and I ran a donut and square one as well. I was curious if those hot spots would show up for constraints without addition, which they do. The donut constraints are just some x,y constraints to make a hole in the middle. Those hot spots still show up. The square constraints don't have an issue, but I think that boils down to rand range calls that don't invoke the swizzler. The square constraints count/hit histogram is the normal distribution that I think we would want the others to have.

Triangle constraints (100k samples):

Donut constraints (100k samples):

Square constraints (100k samples):

This reminds me that I've never really seen a good description of what distribution SystemVerilog should return. The closest I've found is 1800-2017 Section 18.5.10 "Variable ordering":

The solver shall assure that the random values are selected to give a uniform value distribution over legal value combinations (that is, all combinations of legal values have the same probability of being the solution). This important property guarantees that all legal value combinations are equally probable, which allows randomization to better explore the whole design space.

I would think having a uniform distribution across a solution space would require somehow exploring that whole solution space before-hand. That may be possible for simple constraints, but I imagine it's always possible to construct more and more complex constraints until it's impossible to tell what the solution space might be.

I wonder if randomizing all variables across their valid ranges once and then checking them against the constraints would be enough get closer to a uniform distribution. Although I think that would require obeying solve-before constraints somehow to get the ordering and distribution right. And it would only really work for simpler constraints.

alwilson · 2024-07-21T00:58:12Z

Here's an interesting 1d distribution example that explores the swizzler behavior. Pyvsc uses constraints to find the ranges of each variable when randomize is called, but those can change as variables get solved, they could be missing, or much wider than the true range.
For example, here a is unconstrained for b=1, but can be [0,4] for b=0.

rand unsigned [32] a
rand unsigned [1] b
constraint ab_c
    (b == 0) -> (a < 5);

If I constrain b == 0 and a with various new maximum values we go from a uniform dist to very biased.

new_range=5           : a = [211, 204, 184, 211, 190]
new_range=1000        : a = [119, 195, 181, 198, 307]
new_range=1000000     : a = [120, 117, 118, 129, 516]
new_range=1000000000  : a = [42, 23, 35, 34, 866]
new_range=4294967295  : a = [35, 30, 26, 29, 880]

Source

import vsc

TRUE_RANGE=5
NUM_SAMPLES=1_000

@vsc.randobj
class my_c(object):
    def __init__(self):
        self.a = vsc.rand_bit_t(32)
        self.b = vsc.rand_bit_t(1)
    @vsc.constraint
    def ab_c(self):
        vsc.solve_order(self.b, self.a)
        with vsc.implies(self.b == 0):
            self.a < TRUE_RANGE

obj = my_c()
for new_range in [TRUE_RANGE, 1_000, 1_000_000, 1_000_000_000, 2**32-1]:
    bins = [0] * TRUE_RANGE
    for _ in range(NUM_SAMPLES):
        with obj.randomize_with(debug=0):
            obj.b == 0
            obj.a < new_range
        bins[obj.a] += 1
    new_range = str(new_range)
    print(f'{obj.tname} - {new_range=: <12}: {bins}')

I think most of this bias is due to the swizzler switching from randomizing a single bit at a time to 6 bits at a time, and if a isn't constrained enough then it becomes more unlikely it will solve. If I limit the swizzler to 1 bit at a time this isn't as bad, but then it has to run the solver many more times, and randomizing bits one at a time means that it's essentially asking the solver for (2**N)-1 mod max, where max is the actual maximum range. I think most randomization libraries use various methods like rejection sampling to reduce this power of 2 sampling bias.

If you look at the previous 2d examples I've shared with the triangle there are hot spots, and I think those are due to the swizzler randomizing the first variable, then using the same range, [0,50], for the second variable, which isn't accurate most of the time and causes that power of 2 sampling bias. If you mess with the range constraints of that triangle you eventually hit the 6-bits at a time swizzler and it can get funky.

(Image) triangle distribution, x,y range of 600 instead of 60, 10k samples

Here are some possible improvements I've thought of and implemented locally:

Solve for the actual min/max of each variable using the solver
- Pros:
  - Gives the swizzler a much better range, avoids bias
  - Can be done once and reused via cache
- Cons:
  - Somewhat expensive, requires log2(max) solver calls
  - Boolector can only save the model to a file or print on stdout, no direct way to record it
Attempt randomizing all variables across range and solving without swizzling
- Pros:
  - Can give a uniform distribution and skip expensive swizzling
  - At minimum only requires a single solver call
  - Can use cached min/max ranges for better accuracy
- Cons: May not always help
Solve min/max for each variable after each swizzling step to avoid even more bias
- Pros: Good distribution and even less bias
- Cons:
  - Not a uniform distribution
  - Lots of solver calls for each variable, testing seems to make it 50% slower and double number of solver calls

I implemented the each of these features with env vars to control them for now. Here's that 1d example with the cached min/max calculations. On my (older) laptop this reduces runtime by 15% (13.5s to 11.8s), which I assume is b/c the swizzler has fewer bits try. (3 bits vs 5x6-bit fields?)

$ PYVSC_CACHE_RANGE=1 ./161_line_clean.py 
my_c - new_range=5           : [172, 196, 204, 219, 209]
my_c - new_range=1000        : [216, 174, 218, 204, 188]
my_c - new_range=1000000     : [182, 211, 198, 186, 223]
my_c - new_range=1000000000  : [206, 200, 192, 211, 191]
my_c - new_range=4294967295  : [218, 196, 212, 202, 172]

Here are my changes. I'll hold off making a PR until I get some feedback and also figure out making the boolector model dumping cleaner.
alwilson@6472fb6

mballance · 2024-07-23T04:24:15Z

Thanks, @alwilson, for all the investigation and analysis here. I'll need some time to think this over.

One question I had on the conclusions was on this point:

Solve for the actual min/max of each variable using the solver
Pros:
Gives the swizzler a much better range, avoids bias
Can be done once and reused via cache
Cons:
Somewhat expensive, requires log2(max) solver calls
Boolector can only save the model to a file or print on stdout, no direct way to record it

What sort of constraint problems did you find were most helped by this approach? I'm guessing it didn't improve the implies case, since the domains there were conditional.

alwilson · 2024-07-24T02:54:50Z

Hey @mballance, for the implies case I was also setting b == 0 with an inline constraint. So maybe a little contrived for a 1d example... I guess I could've also written it with multiple ranges, which maybe a little more believable.

(b == 0) -> (a < 5)
(b == 1) -> (a < 30)

I've tried chaining constraints together before and I don't think those get broken up. I guess it would be trivial for pyvsc to break down AND constraints though.
(x > 5) & (x < 100)

When I first started exploring this I know I think it would help most when a range isn't specified, as with x + y < 50. There's no traditional range for pyvsc to parse, although some constraint parsing smarts could pull it out.

I should go through the unit tests and the risc-dv and see if there's any real world examples there this would help. I'll admit that I don't think this will help all cases. Most of this is driven by me exploring that initial 2d case I came up with, so maybe this is tailored a little close to what I've explored so far. ;)

mballance added a commit that referenced this issue Aug 18, 2022

0.7.7

a896c3f

- (#161) Adjust swizzling algorithm to improve multi-variable distribution Signed-off-by: Matthew Ballance <matt.ballance@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random distribution of 2-variable constraints not uniform #161

Random distribution of 2-variable constraints not uniform #161

alwilson commented Aug 18, 2022

mballance commented Aug 18, 2022

qzcx commented Aug 18, 2022

alwilson commented Aug 18, 2022

mballance commented Aug 18, 2022

alwilson commented Aug 18, 2022 •

edited

Loading

minwooim commented Jul 9, 2024

alwilson commented Jul 12, 2024

alwilson commented Jul 21, 2024

mballance commented Jul 23, 2024

alwilson commented Jul 24, 2024

Random distribution of 2-variable constraints not uniform #161

Random distribution of 2-variable constraints not uniform #161

Comments

alwilson commented Aug 18, 2022

mballance commented Aug 18, 2022

qzcx commented Aug 18, 2022

alwilson commented Aug 18, 2022

mballance commented Aug 18, 2022

alwilson commented Aug 18, 2022 • edited Loading

minwooim commented Jul 9, 2024

alwilson commented Jul 12, 2024

alwilson commented Jul 21, 2024

mballance commented Jul 23, 2024

alwilson commented Jul 24, 2024

alwilson commented Aug 18, 2022 •

edited

Loading