Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xs performance investigation (array, Map/Set optimizations) #3012

Closed
warner opened this issue May 1, 2021 · 20 comments · Fixed by #3560
Closed

xs performance investigation (array, Map/Set optimizations) #3012

warner opened this issue May 1, 2021 · 20 comments · Fixed by #3560
Assignees
Labels
bug Something isn't working xsnap the XS execution tool

Comments

@warner
Copy link
Member

warner commented May 1, 2021

We're looking at nonlinear performance characteristics of the core datatypes (specifically Array, Map, Set) in XS.

We originally focused on WeakMap, but that part has been addressed.

When we run our load-generator test, we see the zoe vat taking longer and longer with each load cycle. Our current implementation has the following properties:

  • zoe has a number of WeakMaps used to keep track of seats, invitations, etc
    • zoe was written with the assumption that the weakness of WeakMap is sufficient to remove the entries when they are no longer active, so there are no explicit wm.delete() calls
  • swingset does not yet provide GC (sufficiently-deterministic GC #2615), so when liveslots creates a Presence and gives it to zoe, or when zoe creates a Remotable and sends it off-vat, that object will be kept alive by liveslots forever
  • as a result, zoe's WeakMaps grow larger and larger over time, filled with a small number of actively-used items and a much larger number of items which would be garbage collected if only liveslots could drop them
  • on each cycle of the load generator, zoe will add a few items to the WeakMap, and retrieve those same items
    • these items will always be the most recently added ones
  • moreover (and probably more significant), our SES shim uses a single shared WeakMap to implement harden (specifically to track which objects have already been frozen or not). Since our programming/safety style is to harden() everything, this WeakMap will include everything that Zoe puts in the WeakMaps listed above, plus far more. Many of these objects will go away (they are not held by liveslots), but everything that liveslots is holding will probably be included in the harden WeakMap.

We don't specifically know that WeakMap is involved, but given how Zoe works, it's a reasonable hypothesis for something that might grow over time.

Our current working hypotheses are:

  • the XS WeakMap implementation appears to use a simple linked list, which means read time for the first-added entry will be constant, while read time for the most-recently-added entry will be O(num_entries)
    • zoe only interacts with the most-recently added entries, making this worst-case
  • when adding a WeakMap entry, XS must get a new "slot" structure with fxNewSlot
    • if there is free space available (the->freeHeap holds a linked-list of free slots), no allocations are necessary
    • if not, fxNewSlot might trigger GC (fxCollect) to make some available, and/or it might call fxGrowSlots to allocate more by calling malloc()
    • because we aren't deleting items, we call fxNewSlot a lot
    • because the total number of objects in the heap is large, GC takes a long time, O(num_objects)

Together, this would explain a linear slowdown in the time each load-generator cycle takes. It would not explain the superlinear slowdown that we've measured.

@warner warner added the bug Something isn't working label May 1, 2021
@warner warner self-assigned this May 1, 2021
@warner warner added the xsnap the XS execution tool label May 1, 2021
@dtribble
Copy link
Member

dtribble commented May 1, 2021

Change the allocation/GC policy to keep growing memory up to a gc threshold, and only past that trigger a GC.

index 70a9288b..267db5ba 100644
--- a/xs/sources/xsMemory.c
+++ b/xs/sources/xsMemory.c
@@ -1175,7 +1175,8 @@ again:
                }
                aBlock = aBlock->nextBlock;
        }
-       if (once) {
+       txSize gcThreshold = the->allocationLimit * 0.7;
+       if (once && the->allocatedSpace > gcThreshold) {
                fxCollect(the, 1);
                once = 0;
        }
@@ -1210,7 +1211,8 @@ again:
                        the->peakHeapCount = the->currentHeapCount;
                return aSlot;
        }
-       if (once) {
+       txSize gcThreshold = the->allocationLimit * 0.7;
+       if (once && the->allocatedSpace > gcThreshold) {
                txBoolean wasThrashing = ((the->collectFlag & XS_TRASHING_FLAG) != 0), isThrashing;

                fxCollect(the, 0);

This should work well with the strategy of manually triggering GC at the end of cranks.

@warner
Copy link
Member Author

warner commented May 1, 2021

Measurement tasks we're interested in:

  • (@dckc): how many times do fxGrowSlots, fxGrowChunks, and fxCollect get called during each crank?
  • how much runtime does harden consume during a crank? how many times is it called? does this grow as the number of known objects grows?
  • how large are our various WeakMap/WeakSets getting? Does the harden() WeakSet dominate the rest?
  • what is the performance characteristic of a WeakMap/WeakSet, as a function of size?
    • are we asking too much of the engine, and/or should we expect the engine to perform better?

Mitigations we're looking at:

  • change XS to decrease the frequency of GC and malloc by allocating more space at a time (fxGrowSlots, fxGrowChunks): this is a simple config parameter, and adjusting it has a noticeable effect on vat growth curves
  • change XS to decrease the frequency of GC by changing the policy: currently it runs GC before asking malloc for more memory, @dtribble 's patch changes that to ask malloc before automatically running GC
    • in conjunction with manual GC at end-of-crank, this might reduce the number of GC calls to an ideal of 1 per crank
    • @dckc suggested we explore a way to run GC in parallel with other operations, by running it after end-of-crank, which might interact badly with our liveslots-driven end-of-crank syscall.dropImports process, but there might be a way to pull it off, especially for busy vats which we can reasonably expect to be called again soon
  • make a deeper change to XS to use a hash table on WeakMap/WeakSet, to bring the access times down to O(1)/O(logN) instead of O(N)
  • finish kernel/liveslots-side GC, to reduce the number of otherwise-unused objects filling those WeakMaps
  • change Zoe to explicitly delete unused entries from its own WeakMaps
    • if the harden() WeakSet dominates the rest, this probably won't help much
  • change WeakMap to pre-populate a lot of dummy entries first (doing all the allocation up front), and deleting one of them just before each set() call, so there will (almost) always be a free slot available

Given harden(), we expect the biggest improvements to come from kernel-side GC, and making WeakSet more efficient. The easiest mitigations to apply are the GC policy changes, followed by (maybe) Zoe explicit-deletion changes.

@dckc
Copy link
Member

dckc commented May 1, 2021

Measurement tasks we're interested in:

  • (@dckc): how many times do fxGrowSlots, fxGrowChunks, and fxCollect get called during each crank?

The submodule part might need some manual intervention, but...

https://github.com/Agoric/agoric-sdk/tree/3012-xsnap-meters

  • 62628d3 feat(xsnap): meter garbageCollectionCount
  • 9f7ddbf chore(xsnap): set moddable submodule for garbageCollectionCount
  • 59823e1 feat(xsnap): meter calls to allocateChunks, allocateSlots

https://github.com/agoric-labs/moddable/tree/xsnap-gc-count 6291555

@dtribble
Copy link
Member

dtribble commented May 2, 2021

We are seeing O(n^2) in some collections in XS. The implementations are pretty simple because that's well-suited to small memory. In some cases we may need different algorithm or allocation policy. We have some investigation into Arrays and WeakMaps.

Array usage (e.g., push) appears to only extend the array storage by 1 (via fxSetIndexSize), which means that a sequence of growing an array with push is n^2. So I have some allocator questions for @phoddie et al:

  • are chunks allocated with fxNewChunk always the requested size or are they larger such that a fxRenewChunk would grow an array in place after the first push?
  • Does fxRenewChunk allow the array to grow in place?
  • fxRenewChunk appears to iterate through the allocated chunks, starting from the->firstBlock. What prevents that from being O(number of allocated chunks)?
  • Does a chunk enable arrays to have a capacity that is different from their length? If so, how do we allocate that? My current theory is change line 423 in fxSetIndexSize chunk = (txSlot*)fxNewChunk(the, size); to allocate a bigger size.

@erights
Copy link
Member

erights commented May 2, 2021

Hi @phoddie @patrick-soquet are XS class private instance fields constant time, even for return override? I ask in order to get a sense whether endojs/endo#704 might help as a short term kludge --- to be retired asap of course. But is it worth trying?

@dckc
Copy link
Member

dckc commented May 2, 2021

@warner: additions and removals from each WeakMap shares an implementation with (strong) Map and Set; I hope counting all of them suffices:

@erights
Copy link
Member

erights commented May 2, 2021

Once we're in the bad performance regime, if we snapshot and then restore from snapshot, does the restored one perform as badly as just continuing would have? I would suspect that snapshot and restore at least defragments memory.

@warner
Copy link
Member Author

warner commented May 2, 2021

Recent measurements:

  • GC in XS (for a simple linked-list benchmark) is a good linear fit for 80ns/object
  • with @dckc's last changes, we can measure additions and removals from the sum of all Map/Set/WeakMap/WeakSet keys
    • I assume that the harden()
  • our tap-faucet load generator (zoe vat) is adding 6524 persistent objects per cycle, plus another 71 that get collected
  • the zoe vat seems to start with about 285k objects
    • after 376 load cycles, we're up to 2.74M objects, so GC could be expected to take 220ms

My current theory is that we're creating a lot more objects than I expected, the lack of kernel-level GC (most immediately represented by the liveslots "safety pins") is keeping a higher fraction of them alive than I expected, and thus we're subjecting XS to an unreasonable amount of objects, therefore XS GC is taking a long time (and not yielding much of a result).

To gather evidence for/against that theory, I'd like to know:

  • how much time do we spend in GC (fxCollect()) per block
    • we can subtract this from the overall block time, to see if we should continue to focus on GC, or if Map/etc lookup time is also significant
  • how many times do we GC per block (we already have this instrumented)
    • total GC time is num(calls) * time_per_call
    • we've experimented with influencing the number of GC calls by changing the policies in fxGrowSlots and fxGrowChunks, with obvious tradeoffs

Current mitigation ideas:

  • change the GC policies to minimize the number of GC calls
    • we believe GC isn't doing much good, so until it is, let's do it less
    • while a single GC call takes less than maybe 3s, we can still afford to make one per block
    • if the vat grows to the point that a single GC call takes 30s, we can't really afford to ever call GC
    • maybe we should just disable GC entirely for a while
  • have zoe use vatPowers.disown() on something it imported but no longer needs, to tell liveslots to delete the entry, which would let GC collect it (at the cost of errors if it actually uses it again)
    • that would let the Presence be collected, and if that's used as a WeakMap key, it would allow the value to be collected
    • maybe hack together some form of revoke() (revoke a Remotable #2070) to do the same thing to an exported Remotable

Deeper fixes that we can't implement in time but should be prioritised:

  • moving harden() into C, rather than using a user-level WeakMap
    • probably adding a flag to the XSSlot structure, next to the MARK flag used for mark-and-sweep GC
  • changing xsMapSet.c to use @erights 's "transposed representation"
  • kernel-level GC (sufficiently-deterministic GC #2615), of course
  • maybe have zoe explicitly delete finished objects, to accelerate their collection. might be irrelevant once kernel GC is working, or might be a marginal improvement

@dckc
Copy link
Member

dckc commented May 2, 2021

some slog analysis using pandas and such... https://gist.github.com/dckc/975d57a93a52f3f1f8485f38a87b5c5f

@erights
Copy link
Member

erights commented May 2, 2021

Additional mitigation ideas:

@warner
Copy link
Member Author

warner commented May 3, 2021

One problem identified: I think XS is not allocating the heap in large chunks as it was supposed to do. I filed Moddable-OpenSource/moddable#636 to address it. Fixing that should reduce our GC calls to one every 256kB chunk allocations, which I think will mean once every 100-ish loadgen cycles. At 1800 loadgen cycles, I estimate each GC call is taking about 450ms, which should let us get to 20k cycles before causing blocks to be delayed.

Fixing that in XS is an easier/more-direct/more-correct mitigation than any other changes we might make to the GC policy
.

@dckc
Copy link
Member

dckc commented May 3, 2021

I'm inclined to investigate harden() performance in isolation. Adding performance.now() should help. p.s. done in e29b680

side note: using jq to filter the slog should make a wider variety of techniques feasible in that ipython notebook.

@dckc dckc self-assigned this May 3, 2021
@dckc
Copy link
Member

dckc commented May 3, 2021

I tried this out; performance looked ~2x slower:

size dur pr704
1024 0.005422 0.012111
2048 0.012071 0.024621
2048 0.012596 0.021057
4096 0.022986 0.052297
4096 0.021754 0.050533
8192 0.054163 0.134920
8192 0.054778 0.121982
16384 0.120454 0.379237
16384 0.126119 0.251657
32768 0.340179 1.198049
32768 0.319713 0.715910

the numbers above are the duration between start hardening keys and done hardening:

  function bench() {
    // 12 iterations, 65,536 objects, ~never finishes
    for (let iter = 0; iter < 11; iter += 1) {
      const size = 1024 * (1 << Math.round(iter / 2));
      print('iter', iter, size);
      event('prep iter', iter, size);
      const keys = new Array(size);
      for (let ix = 0; ix < size; ix += 1) {
        keys[ix] = {};
      }
      event('start loop over keys', iter, size);
      for (const _key of keys) {
        // how long does looping take?
      }
      event('start hardening keys', iter, size);
      for (const key of keys) {
        harden(key);
      }
      event('done hardening', iter, size);
    }
  }

@dckc
Copy link
Member

dckc commented May 3, 2021

brainstorm / plan

  • restart solo node for data gathering @warner
  • test load gen tasks (AMM, ...) @warner
  • write load gen plan (@warner)
    • should be able to see problems coming
    • monitoring service? block time > 6sec? warner / ski check every 6hrs
    • expose slog file (e.g. over http)
      • summarize... per block...
    • slow down load generator (w/http)
      • @warner to expose beyond localhost
  • review code for keepers (@dckc)
    • find out where testnet branch will be cut from? master? beta?
    • PR feat(xsnap): performance monitoring / metering #3023
      • yes: bfa06e7d | * labs/grow-larger-chunks XS: fxGrowChunks() should respect the->minimumChunksSize
      • metrics in general? yes
      • yes: gc metric
      • Performance.now: yes, but called gettimeofday
      • maxBucketLength: yes
      • get review by, .e.g. @kriskowal
    • 127 -> 503 hash buckets: no
  • remains on loadgen branch: 4c4f32437 HACK: solo delivery: update tempfile for each helper invocation?
  • @FUDCo fixing a comss vat assert Disable comms vat termination via remote comms errors #3024

brainstorm

hypothesis: problem is overlap... trend line... ~1gb to O(10k)

  • raise array issue? let's hear from from Moddable 1st
    • emergency upgrade
  • reduce load to 1 per minute
    • issue: ever-growing memory usage of zoe vat: how long can we last at this rate
      • at what rate does zoe vat memory grow?
      • faucet causes activity over 4 blocks; if 1 dominates, and 1/minute, survivable
      • yet to investigate: non-overlapping rate?
  • large chunk gc fix only helps with overlapping load?
  • impact of zoe perf from data structure algorithms?
    • can't simulate faster without triggering those [?] other problems
  • what happened to ag-solo?
    • work-around: restart with --reset (uses shared zoe but not other shared resources)
    • new load-generator each day, for example
    • contract vat 1 goes away, but a 2nd comes around

@dckc
Copy link
Member

dckc commented May 4, 2021

@warner concurs: we'll track remaining work in future milestones.

@dckc
Copy link
Member

dckc commented May 5, 2021

On second thought, we're actively monitoring this issue in phase 3, so let's track it there for a while longer...

@dckc
Copy link
Member

dckc commented May 13, 2021

@erights for WeakMap performance testing, the high resolution timer test in test-xs-perf.js should make a good template.

@dckc dckc changed the title xs performance investigation xs performance investigation (array growth) Jun 21, 2021
@dckc
Copy link
Member

dckc commented Jun 22, 2021

@michaelfig don't wait for XS array growth optimizations for today's release.

p.s. from Dean Fri, Jun 11, 1:05 PM KC time:
array growth spreadsheet

@dckc dckc removed this from the testnet4-b milestone Jun 30, 2021
@dckc
Copy link
Member

dckc commented Jul 16, 2021

based on our July 12 discussion, I expect this to go in our upcoming metering milestone.

a peek at the xs commit log shows no news just yet (last change: July 10 bigint fix 1cab0125b0e)

@dckc dckc assigned dckc and unassigned warner Jul 16, 2021
@dckc dckc added this to the Testnet: Metering Phase milestone Jul 16, 2021
@dckc dckc changed the title xs performance investigation (array growth) xs performance investigation (array, Map/Set optimizations) Jul 22, 2021
@dckc
Copy link
Member

dckc commented Jul 29, 2021

looks like the latest from moddable 14e252e8 has relevant stuff...

in particular, in xsMapSet.c:

void fxResizeEntries(txMachine* the, txSlot* table, txSlot* list)

dckc added a commit that referenced this issue Jul 29, 2021
dckc added a commit that referenced this issue Jul 30, 2021
dckc added a commit that referenced this issue Jul 30, 2021
XS implementation now has conventional O(n log(n)) algorithm.

 - xs-meter-9 represents XS Map/Set optimizations
 - style: update indentaion in test-xs-perf.js

fixes #3012
dckc added a commit that referenced this issue Jul 30, 2021
*BREAKING CHANGE:* XS Map/Set optimizations lead to compute meter changes; hence `METER_TYPE = 'xs-meter-9'`

fixes #3012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working xsnap the XS execution tool
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants