Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests that grow containers until running out of memory budget. #1242

Merged
merged 7 commits into from
Nov 22, 2023

Conversation

dmkozh
Copy link
Contributor

@dmkozh dmkozh commented Nov 21, 2023

What

Add tests that grow containers until running out of memory budget.

I've also added some simple time measurements for eye-balling the runtime; at least on my machine it is a bit suspiciously slow.

Why

Increasing test coverage.

Known limitations

N/A

I've also added some simple time measurements for eye-balling the runtime; at least on my machine it is a bit suspiciously slow.
@jayz22
Copy link
Contributor

jayz22 commented Nov 21, 2023

We should probably separate out the timing measurement part. These tests typically aren't built with --release which would just generate time results that aren't accurate. We have metering_benchmark.rs tests where the timing, actual cpu/mem, modeled cpu/mem are compared with a comment on top stating these should be run separately in proper setup.

@jayz22
Copy link
Contributor

jayz22 commented Nov 21, 2023

Also FYI, we have #1148 which I'm currently working on (continuing from @sisuresh 's PR). It mentions oversized map and vec instantiation from various paths. I see you've covered one path -- from host function call here (which is totally fine), but we need to cover other paths too.

@dmkozh
Copy link
Contributor Author

dmkozh commented Nov 21, 2023

We should probably separate out the timing measurement part.

This is just for sanity checking this (as you can see, this doesn't make any assertions on the runtime); I'm still interested in results on a different machine. We can clean this up later.

@graydon
Copy link
Contributor

graydon commented Nov 22, 2023

@dmkozh From what I can tell, the following factors are at work in the over-measurement:

  • On testnet / futurenet: we're measuring real contracts and to a first approximation those are almost entirely spent in VM instantiation (they're neither microbenchmarks nor malicious) and we over-count the cost of VM instantiation by a lot because it has major variability in its worst case. So the model thinks we're doing as much as 10x more work than we really are. This will get better once we speed up VM instantiation (caching and such) but it's always going to be a bit heavy. A good option for sniff-testing metrics would be to collect a second metric that excludes the VM instantiation.
  • We calibrated (so far) on mac M1 rather than Intel, and that means we over-count instructions a bit. Possibly a factor of 2 or less, but it adds noise. We're intending to recalibrate on Intel before final.
  • Your experiment in varying costs between vec_new and vec_push reveals some significant non-linearity in the CPU costs of memory allocation, at least on my machine. As far as I can tell the allocator is just a lot faster at allocating 0-byte, 1-byte, 8-byte, and a few other "small" sized objects (presumably those it has small freelists for) than it is at allocating larger ones. The MemAlloc cost model is calibrated using allocations up in the multiple-KiB range (which again, it kinda has to be because people can allocate things that large), so we overcount the number of instructions in a small allocation like vec_new. You can observe this if you change MeteredVec::new from Vec::new to Vec::with_capacity(256) or so: your instructions-per-nanosecond will drop from 10 to below 1. We might be able to minimize this overcounting if we move the host-object population (and/or other dynamically allocated host values) to an arena allocator, as we've occasionally discussed.

I'm not entirely sure about the last point -- I've struggled this afternoon to really understand the calibration code and figure out how to get it to measure MemAlloc on small sizes at all -- and I would appreciate input from @jayz22 on this. But I think basically this is within the realm of "the best we're going to be able to do".

@graydon graydon added this pull request to the merge queue Nov 22, 2023
Merged via the queue into stellar:main with commit 8cc4b43 Nov 22, 2023
10 checks passed
@jayz22
Copy link
Contributor

jayz22 commented Nov 22, 2023

@graydon yes, completely agree on 1 and 2. The VmInstantiation and M1 are the two factors of overestimation. I'm working on recalibrating on non-M1 machine currently so the latter will go away soon.

@dmkozh 's main concern (from our conversation yesterday), is he is observing about 1~2 insn/nsec ratio when running these tests on his machine. Instead of 10-20 that the calibration suggests (and the dashboard data suggests). Which suggests we might be underestimating the cpu cost of memalloc by an order of magnitude, at least possibly on some arch.

The current calibration allocates a contiguous data up to some size N and extracts the linear coefficient, and it it quite low ~1/128 cpu insn per byte. I was thinking we instead allocate a small chunk of memory (size k) a varying number of times (N/k) to increase the average cost per byte. This might be more aligned with realistic use case where many small objects are created instead of multi-kb ones (as Dima's example here shows)?

I am going to do some experiments around it tomorrow and report back.

@jayz22
Copy link
Contributor

jayz22 commented Nov 22, 2023

I went back and double checked the MemAlloc calibration, and played around with a few tweaks.
In conclusion, I didn't find anything wrong with the calibration.
The cpu cost for allocating a chunk of memory grows very slowly w.r.t the number of bytes. Below is attached result for sweeping through 128 bytes x (0..200). Focus on the ave_net_input and ave_net_insns columns which are the x and y axies.
output.txt (This is done on an x86 machine)

What this means is both vec_new (which allocs an empty container) and vec_push_back (which allocates a container of size vec.len) will pay simliar amount of MemAlloc cost.

We do not have time to do detailed analysis on real time vs measured cpu cost at the function level, which will be a necesary next step after excluding the VmInstantiation cost.

@dmkozh
Copy link
Contributor Author

dmkozh commented Nov 23, 2023

The cpu cost for allocating a chunk of memory grows very slowly w.r.t the number of bytes.

Thanks for the doing this benchmark, but that kind of raises more questions than it answers. If mem alloc calibration is fine, then where could the discrepancy come from?

I did a very quick and dirty analysis of pushing an element into vecs of different sizes. The instruction execution time seemingly grows linearly (R^2=0.98) until plateau at about ~40kB. Recalibration doesn't change the dependency, but reduces the absolute range between the 'best' and the 'worst' cases. I'm not sure if we should be overly concerned, but I can't quite explain this to myself using the reasoning above.

@graydon
Copy link
Contributor

graydon commented Nov 23, 2023

A few additional points:

  • Starting at 128 bytes is not small enough (as you can see in @dmkozh's tests and as I observed when I did some tests myself). Mallocs almost always have "small object" freelists that go much faster for allocations that are much smaller: 8 bytes, 16, 24, 32, etc. Zero-sized objects (such as Vec::new) may be even faster, they might not enter malloc at all, just store a null pointer. And we do actually allocate small objects (8, 16, 32 bytes) fairly often.
  • Mallocs differ between operating systems and toolchain builds, often a lot. I think Rust defaults to using the system allocator these days (which is radically different between linux and macOS) but I think it might also have an option (maybe even a default in some configs) to use jemalloc. It's worth figuring out what you're actually measuring!
  • Benchmarking via vector operations is going to be misleading if you do anything other than Vec::with_capacity (you might even want to avoid that, and just benchmark Box::new or something. Vec::push definitely doesn't tell you what an allocation costs -- it's doing a memcpy as well as possibly a few other things.
  • I think I am confused about the concern here. In my reasoning, if the model says it's executing 10 insns / nsec but CPU counters say the machine is executing closer to 1 insn / nsec, then we're over estimating, not underestimating, right? Like .. instructions don't actually matter. We care about time. So if we want execution to finish in (say) 1msec, and we give a budget of N "instructions" which we think will be depleted in 1msec, but our model counts some operations as incurring too many "instructions" and those N "instructions" therefore all get depleted in 0.1msec, then .. we just made more time for ourselves, right? It's maybe under-utilizing the machine, over-throttling traffic, but not a DoS risk right?
  • I think it's quite reasonable to assume we're overcounting because we benchmarked the slow paths for large allocations in a malloc, but we mostly execute the fast paths for small allocations in a malloc. Many of both the slow and fast paths should actually be constant time (they don't fill the data they're returning or anything, just pull it out of a data structure) but .. it depends on the accounting structure used. Freelists: constant. Bitmaps: linear. Buddy-allocators: logarithmic maybe? Anyway I suspect if there's any observable "linear factor" in allocation perf, it's just as likely the model interpolating between totally different behaviours: special cases hitting the tail end of the fast-paths, and non-special cases that hit the slow paths (or those that are big enough to be multi-page allocations, which may become linear-cost again due to the paged nature of memory). I'm really not sure there's an easy way to benchmark this accurately in a way that'll give you a linear equation that represents all allocations correctly. I think mallocs do very different things for different size classes.
  • I don't think we're likely to make the bound especially tight in any case, because cache effects are ignored in the model, but in reality (i.e. by the wall clock that we actually care about) cache effects cause easily 2 or even 3 orders of magnitude in time-variation.

@dmkozh
Copy link
Contributor Author

dmkozh commented Nov 27, 2023

if the model says it's executing 10 insns / nsec but CPU counters say the machine is executing closer to 1 insn / nsec, then we're over estimating, not underestimating,

That would be bad, right? If we set the network limit based on the assumption that 10 instructions take 1 ns (say, 1s == 1e10 insns ledger-wide limit), then if the code actually executes at 1 insn/ns, we would have 10s execution time. But if we set the limit based on the most pessimistic/worst cases (say, 1s == 1e9 insns), then most of the time we'd waste 90% of ledger close time available (at least looking at the current insn/ns ratio). So while I agree that the exact ratio between insns and wall time doesn't matter too much, discrepancies like make it much harder to answer the question of 'what is the reasonable ledger-wide instructions limit, given that the target ledger close time is X ms?'. Probably this particular case isn't too important because it hits memory limits fast, but in general the less discrepancy we have, the better.

@graydon
Copy link
Contributor

graydon commented Nov 27, 2023

That would be bad, right? If we set the network limit based on the assumption that 10 instructions take 1 ns (say, 1s == 1e10 insns ledger-wide limit), then if the code actually executes at 1 insn/ns, we would have 10s execution time

It depends which of the disagreeing signals we use to set the budget limit: 1 or 10. In one case we risk slow ledgers, in another we risk underutilized ledgers.

I believe we will never be able to get this bound tight -- a "model instruction" is really just a "virtual time period" and there are just too many things that influence time being collapsed together to get it tight -- so I recommend setting our limits on the side that risks underutilized ledgers. We can continue to explore and refine sources of inaccuracy over time (especially once we have more signal from validators about real perf costs) but I know which failure mode I'm more comfortable with to start.

I think probably the best thing we can do to get better signal is plumb through an additional metric that counts "virtual instructions excluding VM instantiation" because we know that:

  • we're massively over-estimating that for safety sake
  • it makes up the vast majority of the virtual instruction budget

If might be that if you take that term out, the signal tightens up. That'd be nice! But we don't know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants