Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random distribution of 2-variable constraints not uniform #161

Open
alwilson opened this issue Aug 18, 2022 · 10 comments
Open

Random distribution of 2-variable constraints not uniform #161

alwilson opened this issue Aug 18, 2022 · 10 comments

Comments

@alwilson
Copy link
Contributor

[ Using pyvsc-0.7.6 ]

I've been curious about how uniform the random distributions for SV and pyvsc constraints are and cooked up some 2D examples to help visualize the distribution with heatmaps. The heatmaps show pyvsc making some interesting patterns rather than being flat and I was curious whether that was expected. The same constraints in SV appear to be fairly uniform. The heatmap goes from 0/black to the max value / white.

The constraints for a triangle, used in the examples below:

@vsc.randobj
class my_item_c(object):
    def __init__(self):
        self.a = vsc.rand_bit_t(32)
        self.b = vsc.rand_bit_t(32)

    # Triangle
    @vsc.constraint
    def ab_c(self):
        self.a >= 0
        self.b >= 0
        self.a <= 60 # Prevent overflow solutions
        self.b <= 60 # Prevent overflow solutions
        self.a + self.b <= 50

Triangle using pyvsc (100k randomizations, ranging from 14 to 831 hits per bucket):
triangle_pyvsc_100k_180persec

Triangle using SV (100k randomizations, ranging from 49 to 105 hits per bucket):
triangle_other_100k_116kpersec

Source for the above SV and pyvsc examples of a simple triangle constraint (x + y <= N). The dist.py script is pretty simple and has two functions to choose from for either a live animation of the randomizations or a performance measurement. The SV file and plot.py are just the same triangle and circle constraints I was testing out.
https://github.com/alwilson/pyvsc_testing

Note: I've also found that SV constraint solving is very fast (100k simple 2-var randomizations in ~2s) and that pyvsc is much slower (~2000s), which I should say I'm not surprised at, although I'd love to know better why and the differences between the solvers/generators.

I also tried putting a "hole" in the triangle constraints, which you can see in my source, but chaining constraints with or/and doesn't seem to do the same thing as SV does. Not sure if it's a pyvsc bug or if there's a different why to write out those constraints.

Thanks for developing this project! I've been having fun with it. >:)

@mballance
Copy link
Member

Hi @alwilson, thanks for putting the work into creating this way to look at 2d distributions. Thus far, I've primarily been looking at 1d distributions -- and that's difficult enough in text format given a reasonably-sized domain.
I'm a bit surprised that the distribution is as skewed as it is, and will need to dig in a bit more to understand why. I find it very interesting that there appear to be hot spots (min a, max b) and (max a, min b) -- hopefully a clue that leads in a productive direction!

Again, thanks for putting together this way to visualize PyVSC distributions. Hopefully I'll shortly have some insights on how to improve the results!

@qzcx
Copy link

qzcx commented Aug 18, 2022

The great part about python is all the built in graphing and data crunching tools!

I've seen similar imbalances in my results, but they've been good enough for my uses up to this point.

@alwilson
Copy link
Contributor Author

Had some fun with histograms and thought I'd post them here, too. I'll upload my changes once I clean things up. A good excuse for playing around with matplotlib and numpy!

pyvsc (100k samples)
pyvsc_100k

python random.randrange calls (100k samples)
I rewrote the constraints as a conditional and used the random library to get something fast that's similar to what SV outputs to compare with.
python_rand_100k

mballance added a commit that referenced this issue Aug 18, 2022
- (#161) Adjust swizzling algorithm to improve multi-variable distribution

Signed-off-by: Matthew Ballance <matt.ballance@gmail.com>
@mballance
Copy link
Member

After reviewing the 'swizzling' implementation, I have a better understanding of the hot spots at (min,max) and (max,min). While the variables to swizzle in each solve set are randomly selected, the bits to force in each were being applied sequentially. This increased the probability that few of the bit-force constraints on the second variable would hold, allowing Boolector to repeatedly-select the same solution for the second variable.
Release 0.7.7 randomizes the order in which bit-force constraints are applied. From a visual inspection of text output, this appears to provide a good improvement in multi-variable distribution. Please give the new release a try as you're able.

@alwilson
Copy link
Contributor Author

alwilson commented Aug 18, 2022

That looks better! On the histogram it looks like it's a bit more spread out and the hotspots spread out and lowered. Still doesn't have that normal distribution that I think we would want. Although it's been a while since I've thought about probability...

pyvsc triangle (100k samples)
pyvsc_triangle_100k_7_7

I did want to point out that in a 1M quarter circle constraint in SV the distribution does not appear to be normal. Or at least it tends to have hotspots as well. So I'm not sure what the gold standard for pyvsc should be. Adding multiplication constraints into the mix perhaps makes it harder to guarantee?
image

@minwooim
Copy link

minwooim commented Jul 9, 2024

Hello :)

Any news on this issue ?

@alwilson
Copy link
Contributor Author

I reran some distribution tests on 5f79289. I don't believe much has changed with the pyvsc's randomization and swizzler.

Here is the triangle distribution from before, and I ran a donut and square one as well. I was curious if those hot spots would show up for constraints without addition, which they do. The donut constraints are just some x,y constraints to make a hole in the middle. Those hot spots still show up. The square constraints don't have an issue, but I think that boils down to rand range calls that don't invoke the swizzler. The square constraints count/hit histogram is the normal distribution that I think we would want the others to have.

Triangle constraints (100k samples):
pyvsc_triangle_dist_100k_july11_2024

Donut constraints (100k samples):
pyvsc_donut_dist_100k_july11_2024

Square constraints (100k samples):
pyvcs_square_dist_100k_july11_2024

This reminds me that I've never really seen a good description of what distribution SystemVerilog should return. The closest I've found is 1800-2017 Section 18.5.10 "Variable ordering":

The solver shall assure that the random values are selected to give a uniform value distribution over legal value combinations (that is, all combinations of legal values have the same probability of being the solution). This important property guarantees that all legal value combinations are equally probable, which allows randomization to better explore the whole design space.

I would think having a uniform distribution across a solution space would require somehow exploring that whole solution space before-hand. That may be possible for simple constraints, but I imagine it's always possible to construct more and more complex constraints until it's impossible to tell what the solution space might be.

I wonder if randomizing all variables across their valid ranges once and then checking them against the constraints would be enough get closer to a uniform distribution. Although I think that would require obeying solve-before constraints somehow to get the ordering and distribution right. And it would only really work for simpler constraints.

@alwilson
Copy link
Contributor Author

Here's an interesting 1d distribution example that explores the swizzler behavior. Pyvsc uses constraints to find the ranges of each variable when randomize is called, but those can change as variables get solved, they could be missing, or much wider than the true range.
For example, here a is unconstrained for b=1, but can be [0,4] for b=0.

rand unsigned [32] a
rand unsigned [1] b
constraint ab_c
    (b == 0) -> (a < 5);

If I constrain b == 0 and a with various new maximum values we go from a uniform dist to very biased.

new_range=5           : a = [211, 204, 184, 211, 190]
new_range=1000        : a = [119, 195, 181, 198, 307]
new_range=1000000     : a = [120, 117, 118, 129, 516]
new_range=1000000000  : a = [42, 23, 35, 34, 866]
new_range=4294967295  : a = [35, 30, 26, 29, 880]
Source
import vsc

TRUE_RANGE=5
NUM_SAMPLES=1_000

@vsc.randobj
class my_c(object):
    def __init__(self):
        self.a = vsc.rand_bit_t(32)
        self.b = vsc.rand_bit_t(1)
    @vsc.constraint
    def ab_c(self):
        vsc.solve_order(self.b, self.a)
        with vsc.implies(self.b == 0):
            self.a < TRUE_RANGE

obj = my_c()
for new_range in [TRUE_RANGE, 1_000, 1_000_000, 1_000_000_000, 2**32-1]:
    bins = [0] * TRUE_RANGE
    for _ in range(NUM_SAMPLES):
        with obj.randomize_with(debug=0):
            obj.b == 0
            obj.a < new_range
        bins[obj.a] += 1
    new_range = str(new_range)
    print(f'{obj.tname} - {new_range=: <12}: {bins}')

I think most of this bias is due to the swizzler switching from randomizing a single bit at a time to 6 bits at a time, and if a isn't constrained enough then it becomes more unlikely it will solve. If I limit the swizzler to 1 bit at a time this isn't as bad, but then it has to run the solver many more times, and randomizing bits one at a time means that it's essentially asking the solver for (2**N)-1 mod max, where max is the actual maximum range. I think most randomization libraries use various methods like rejection sampling to reduce this power of 2 sampling bias.

If you look at the previous 2d examples I've shared with the triangle there are hot spots, and I think those are due to the swizzler randomizing the first variable, then using the same range, [0,50], for the second variable, which isn't accurate most of the time and causes that power of 2 sampling bias. If you mess with the range constraints of that triangle you eventually hit the 6-bits at a time swizzler and it can get funky.

(Image) triangle distribution, x,y range of 600 instead of 60, 10k samples

triangle_dist_range_600_10k_samples

Here are some possible improvements I've thought of and implemented locally:

  • Solve for the actual min/max of each variable using the solver
    • Pros:
      • Gives the swizzler a much better range, avoids bias
      • Can be done once and reused via cache
    • Cons:
      • Somewhat expensive, requires log2(max) solver calls
      • Boolector can only save the model to a file or print on stdout, no direct way to record it
  • Attempt randomizing all variables across range and solving without swizzling
    • Pros:
      • Can give a uniform distribution and skip expensive swizzling
      • At minimum only requires a single solver call
      • Can use cached min/max ranges for better accuracy
    • Cons: May not always help
  • Solve min/max for each variable after each swizzling step to avoid even more bias
    • Pros: Good distribution and even less bias
    • Cons:
      • Not a uniform distribution
      • Lots of solver calls for each variable, testing seems to make it 50% slower and double number of solver calls

I implemented the each of these features with env vars to control them for now. Here's that 1d example with the cached min/max calculations. On my (older) laptop this reduces runtime by 15% (13.5s to 11.8s), which I assume is b/c the swizzler has fewer bits try. (3 bits vs 5x6-bit fields?)

$ PYVSC_CACHE_RANGE=1 ./161_line_clean.py 
my_c - new_range=5           : [172, 196, 204, 219, 209]
my_c - new_range=1000        : [216, 174, 218, 204, 188]
my_c - new_range=1000000     : [182, 211, 198, 186, 223]
my_c - new_range=1000000000  : [206, 200, 192, 211, 191]
my_c - new_range=4294967295  : [218, 196, 212, 202, 172]

Here are my changes. I'll hold off making a PR until I get some feedback and also figure out making the boolector model dumping cleaner.
alwilson@6472fb6

@mballance
Copy link
Member

Thanks, @alwilson, for all the investigation and analysis here. I'll need some time to think this over.

One question I had on the conclusions was on this point:

Solve for the actual min/max of each variable using the solver
Pros:
Gives the swizzler a much better range, avoids bias
Can be done once and reused via cache
Cons:
Somewhat expensive, requires log2(max) solver calls
Boolector can only save the model to a file or print on stdout, no direct way to record it

What sort of constraint problems did you find were most helped by this approach? I'm guessing it didn't improve the implies case, since the domains there were conditional.

@alwilson
Copy link
Contributor Author

Hey @mballance, for the implies case I was also setting b == 0 with an inline constraint. So maybe a little contrived for a 1d example... I guess I could've also written it with multiple ranges, which maybe a little more believable.

(b == 0) -> (a < 5)
(b == 1) -> (a < 30)

I've tried chaining constraints together before and I don't think those get broken up. I guess it would be trivial for pyvsc to break down AND constraints though.
(x > 5) & (x < 100)

When I first started exploring this I know I think it would help most when a range isn't specified, as with x + y < 50. There's no traditional range for pyvsc to parse, although some constraint parsing smarts could pull it out.

I should go through the unit tests and the risc-dv and see if there's any real world examples there this would help. I'll admit that I don't think this will help all cases. Most of this is driven by me exploring that initial 2d case I came up with, so maybe this is tailored a little close to what I've explored so far. ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants