-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tests that grow containers until running out of memory budget. #1242
Conversation
I've also added some simple time measurements for eye-balling the runtime; at least on my machine it is a bit suspiciously slow.
We should probably separate out the timing measurement part. These tests typically aren't built with |
This is just for sanity checking this (as you can see, this doesn't make any assertions on the runtime); I'm still interested in results on a different machine. We can clean this up later. |
@dmkozh From what I can tell, the following factors are at work in the over-measurement:
I'm not entirely sure about the last point -- I've struggled this afternoon to really understand the calibration code and figure out how to get it to measure |
@graydon yes, completely agree on 1 and 2. The @dmkozh 's main concern (from our conversation yesterday), is he is observing about 1~2 insn/nsec ratio when running these tests on his machine. Instead of 10-20 that the calibration suggests (and the dashboard data suggests). Which suggests we might be underestimating the cpu cost of The current calibration allocates a contiguous data up to some size N and extracts the linear coefficient, and it it quite low ~1/128 cpu insn per byte. I was thinking we instead allocate a small chunk of memory (size k) a varying number of times (N/k) to increase the average cost per byte. This might be more aligned with realistic use case where many small objects are created instead of multi-kb ones (as Dima's example here shows)? I am going to do some experiments around it tomorrow and report back. |
I went back and double checked the What this means is both We do not have time to do detailed analysis on real time vs measured cpu cost at the function level, which will be a necesary next step after excluding the VmInstantiation cost. |
Thanks for the doing this benchmark, but that kind of raises more questions than it answers. If mem alloc calibration is fine, then where could the discrepancy come from? I did a very quick and dirty analysis of pushing an element into vecs of different sizes. The instruction execution time seemingly grows linearly (R^2=0.98) until plateau at about ~40kB. Recalibration doesn't change the dependency, but reduces the absolute range between the 'best' and the 'worst' cases. I'm not sure if we should be overly concerned, but I can't quite explain this to myself using the reasoning above. |
A few additional points:
|
That would be bad, right? If we set the network limit based on the assumption that 10 instructions take 1 ns (say, 1s == 1e10 insns ledger-wide limit), then if the code actually executes at 1 insn/ns, we would have 10s execution time. But if we set the limit based on the most pessimistic/worst cases (say, 1s == 1e9 insns), then most of the time we'd waste 90% of ledger close time available (at least looking at the current insn/ns ratio). So while I agree that the exact ratio between insns and wall time doesn't matter too much, discrepancies like make it much harder to answer the question of 'what is the reasonable ledger-wide instructions limit, given that the target ledger close time is X ms?'. Probably this particular case isn't too important because it hits memory limits fast, but in general the less discrepancy we have, the better. |
It depends which of the disagreeing signals we use to set the budget limit: 1 or 10. In one case we risk slow ledgers, in another we risk underutilized ledgers. I believe we will never be able to get this bound tight -- a "model instruction" is really just a "virtual time period" and there are just too many things that influence time being collapsed together to get it tight -- so I recommend setting our limits on the side that risks underutilized ledgers. We can continue to explore and refine sources of inaccuracy over time (especially once we have more signal from validators about real perf costs) but I know which failure mode I'm more comfortable with to start. I think probably the best thing we can do to get better signal is plumb through an additional metric that counts "virtual instructions excluding VM instantiation" because we know that:
If might be that if you take that term out, the signal tightens up. That'd be nice! But we don't know. |
What
Add tests that grow containers until running out of memory budget.
I've also added some simple time measurements for eye-balling the runtime; at least on my machine it is a bit suspiciously slow.
Why
Increasing test coverage.
Known limitations
N/A