xs performance investigation (array, Map/Set optimizations) #3012

warner · 2021-05-01T20:46:24Z

We're looking at nonlinear performance characteristics of the core datatypes (specifically Array, Map, Set) in XS.

We originally focused on WeakMap, but that part has been addressed.

When we run our load-generator test, we see the zoe vat taking longer and longer with each load cycle. Our current implementation has the following properties:

zoe has a number of WeakMaps used to keep track of seats, invitations, etc
- zoe was written with the assumption that the weakness of WeakMap is sufficient to remove the entries when they are no longer active, so there are no explicit wm.delete() calls
swingset does not yet provide GC (sufficiently-deterministic GC #2615), so when liveslots creates a Presence and gives it to zoe, or when zoe creates a Remotable and sends it off-vat, that object will be kept alive by liveslots forever
as a result, zoe's WeakMaps grow larger and larger over time, filled with a small number of actively-used items and a much larger number of items which would be garbage collected if only liveslots could drop them
on each cycle of the load generator, zoe will add a few items to the WeakMap, and retrieve those same items
- these items will always be the most recently added ones
moreover (and probably more significant), our SES shim uses a single shared WeakMap to implement harden (specifically to track which objects have already been frozen or not). Since our programming/safety style is to harden() everything, this WeakMap will include everything that Zoe puts in the WeakMaps listed above, plus far more. Many of these objects will go away (they are not held by liveslots), but everything that liveslots is holding will probably be included in the harden WeakMap.

We don't specifically know that WeakMap is involved, but given how Zoe works, it's a reasonable hypothesis for something that might grow over time.

Our current working hypotheses are:

the XS WeakMap implementation appears to use a simple linked list, which means read time for the first-added entry will be constant, while read time for the most-recently-added entry will be O(num_entries)
- zoe only interacts with the most-recently added entries, making this worst-case
when adding a WeakMap entry, XS must get a new "slot" structure with fxNewSlot
- if there is free space available (the->freeHeap holds a linked-list of free slots), no allocations are necessary
- if not, fxNewSlot might trigger GC (fxCollect) to make some available, and/or it might call fxGrowSlots to allocate more by calling malloc()
- because we aren't deleting items, we call fxNewSlot a lot
- because the total number of objects in the heap is large, GC takes a long time, O(num_objects)

Together, this would explain a linear slowdown in the time each load-generator cycle takes. It would not explain the superlinear slowdown that we've measured.

The text was updated successfully, but these errors were encountered:

dtribble · 2021-05-01T20:49:11Z

Change the allocation/GC policy to keep growing memory up to a gc threshold, and only past that trigger a GC.

index 70a9288b..267db5ba 100644
--- a/xs/sources/xsMemory.c
+++ b/xs/sources/xsMemory.c
@@ -1175,7 +1175,8 @@ again:
                }
                aBlock = aBlock->nextBlock;
        }
-       if (once) {
+       txSize gcThreshold = the->allocationLimit * 0.7;
+       if (once && the->allocatedSpace > gcThreshold) {
                fxCollect(the, 1);
                once = 0;
        }
@@ -1210,7 +1211,8 @@ again:
                        the->peakHeapCount = the->currentHeapCount;
                return aSlot;
        }
-       if (once) {
+       txSize gcThreshold = the->allocationLimit * 0.7;
+       if (once && the->allocatedSpace > gcThreshold) {
                txBoolean wasThrashing = ((the->collectFlag & XS_TRASHING_FLAG) != 0), isThrashing;

                fxCollect(the, 0);

This should work well with the strategy of manually triggering GC at the end of cranks.

warner · 2021-05-01T21:51:27Z

Measurement tasks we're interested in:

(@dckc): how many times do fxGrowSlots, fxGrowChunks, and fxCollect get called during each crank?
how much runtime does harden consume during a crank? how many times is it called? does this grow as the number of known objects grows?
how large are our various WeakMap/WeakSets getting? Does the harden() WeakSet dominate the rest?
what is the performance characteristic of a WeakMap/WeakSet, as a function of size?
- are we asking too much of the engine, and/or should we expect the engine to perform better?

Mitigations we're looking at:

change XS to decrease the frequency of GC and malloc by allocating more space at a time (fxGrowSlots, fxGrowChunks): this is a simple config parameter, and adjusting it has a noticeable effect on vat growth curves
change XS to decrease the frequency of GC by changing the policy: currently it runs GC before asking malloc for more memory, @dtribble 's patch changes that to ask malloc before automatically running GC
- in conjunction with manual GC at end-of-crank, this might reduce the number of GC calls to an ideal of 1 per crank
- @dckc suggested we explore a way to run GC in parallel with other operations, by running it after end-of-crank, which might interact badly with our liveslots-driven end-of-crank syscall.dropImports process, but there might be a way to pull it off, especially for busy vats which we can reasonably expect to be called again soon
make a deeper change to XS to use a hash table on WeakMap/WeakSet, to bring the access times down to O(1)/O(logN) instead of O(N)
finish kernel/liveslots-side GC, to reduce the number of otherwise-unused objects filling those WeakMaps
change Zoe to explicitly delete unused entries from its own WeakMaps
- if the harden() WeakSet dominates the rest, this probably won't help much
change WeakMap to pre-populate a lot of dummy entries first (doing all the allocation up front), and deleting one of them just before each set() call, so there will (almost) always be a free slot available

Given harden(), we expect the biggest improvements to come from kernel-side GC, and making WeakSet more efficient. The easiest mitigations to apply are the GC policy changes, followed by (maybe) Zoe explicit-deletion changes.

dckc · 2021-05-01T23:10:54Z

Measurement tasks we're interested in:

(@dckc): how many times do fxGrowSlots, fxGrowChunks, and fxCollect get called during each crank?

The submodule part might need some manual intervention, but...

https://github.com/Agoric/agoric-sdk/tree/3012-xsnap-meters

62628d3 feat(xsnap): meter garbageCollectionCount
9f7ddbf chore(xsnap): set moddable submodule for garbageCollectionCount
59823e1 feat(xsnap): meter calls to allocateChunks, allocateSlots

https://github.com/agoric-labs/moddable/tree/xsnap-gc-count 6291555

dtribble · 2021-05-02T00:10:50Z

We are seeing O(n^2) in some collections in XS. The implementations are pretty simple because that's well-suited to small memory. In some cases we may need different algorithm or allocation policy. We have some investigation into Arrays and WeakMaps.

Array usage (e.g., push) appears to only extend the array storage by 1 (via fxSetIndexSize), which means that a sequence of growing an array with push is n^2. So I have some allocator questions for @phoddie et al:

are chunks allocated with fxNewChunk always the requested size or are they larger such that a fxRenewChunk would grow an array in place after the first push?
Does fxRenewChunk allow the array to grow in place?
fxRenewChunk appears to iterate through the allocated chunks, starting from the->firstBlock. What prevents that from being O(number of allocated chunks)?
Does a chunk enable arrays to have a capacity that is different from their length? If so, how do we allocate that? My current theory is change line 423 in fxSetIndexSize chunk = (txSlot*)fxNewChunk(the, size); to allocate a bigger size.

erights · 2021-05-02T02:05:42Z

Hi @phoddie @patrick-soquet are XS class private instance fields constant time, even for return override? I ask in order to get a sense whether endojs/endo#704 might help as a short term kludge --- to be retired asap of course. But is it worth trying?

dckc · 2021-05-02T14:06:34Z

@warner: additions and removals from each WeakMap shares an implementation with (strong) Map and Set; I hope counting all of them suffices:

40a9305 feat(xsnap): meter add/remove on map, set https://github.com/Agoric/agoric-sdk/tree/3012-xsnap-meters
- fd202a2 feat(xsnap): count additions and removals from maps, sets https://github.com/agoric-labs/moddable/tree/xsnap-gc-count

erights · 2021-05-02T20:56:08Z

Once we're in the bad performance regime, if we snapshot and then restore from snapshot, does the restored one perform as badly as just continuing would have? I would suspect that snapshot and restore at least defragments memory.

warner · 2021-05-02T22:37:59Z

Recent measurements:

GC in XS (for a simple linked-list benchmark) is a good linear fit for 80ns/object
with @dckc's last changes, we can measure additions and removals from the sum of all Map/Set/WeakMap/WeakSet keys
- I assume that the harden()
our tap-faucet load generator (zoe vat) is adding 6524 persistent objects per cycle, plus another 71 that get collected
the zoe vat seems to start with about 285k objects
- after 376 load cycles, we're up to 2.74M objects, so GC could be expected to take 220ms

My current theory is that we're creating a lot more objects than I expected, the lack of kernel-level GC (most immediately represented by the liveslots "safety pins") is keeping a higher fraction of them alive than I expected, and thus we're subjecting XS to an unreasonable amount of objects, therefore XS GC is taking a long time (and not yielding much of a result).

To gather evidence for/against that theory, I'd like to know:

how much time do we spend in GC (fxCollect()) per block
- we can subtract this from the overall block time, to see if we should continue to focus on GC, or if Map/etc lookup time is also significant
how many times do we GC per block (we already have this instrumented)
- total GC time is num(calls) * time_per_call
- we've experimented with influencing the number of GC calls by changing the policies in fxGrowSlots and fxGrowChunks, with obvious tradeoffs

Current mitigation ideas:

change the GC policies to minimize the number of GC calls
- we believe GC isn't doing much good, so until it is, let's do it less
- while a single GC call takes less than maybe 3s, we can still afford to make one per block
- if the vat grows to the point that a single GC call takes 30s, we can't really afford to ever call GC
- maybe we should just disable GC entirely for a while
have zoe use vatPowers.disown() on something it imported but no longer needs, to tell liveslots to delete the entry, which would let GC collect it (at the cost of errors if it actually uses it again)
- that would let the Presence be collected, and if that's used as a WeakMap key, it would allow the value to be collected
- maybe hack together some form of revoke() (revoke a Remotable #2070) to do the same thing to an exported Remotable

Deeper fixes that we can't implement in time but should be prioritised:

moving harden() into C, rather than using a user-level WeakMap
- probably adding a flag to the XSSlot structure, next to the MARK flag used for mark-and-sweep GC
changing xsMapSet.c to use @erights 's "transposed representation"
kernel-level GC (sufficiently-deterministic GC #2615), of course
maybe have zoe explicitly delete finished objects, to accelerate their collection. might be irrelevant once kernel GC is working, or might be a marginal improvement

dckc · 2021-05-02T23:16:44Z

some slog analysis using pandas and such... https://gist.github.com/dckc/975d57a93a52f3f1f8485f38a87b5c5f

erights · 2021-05-02T23:52:20Z

Additional mitigation ideas:

Try "Repairs" WeakMap,WeakSet by kludging the #703 kludge to work around acorn staleness. endojs/endo#704, as an experiment?
See what different saving and restoring from snapshot does?

warner · 2021-05-03T01:42:29Z

One problem identified: I think XS is not allocating the heap in large chunks as it was supposed to do. I filed Moddable-OpenSource/moddable#636 to address it. Fixing that should reduce our GC calls to one every 256kB chunk allocations, which I think will mean once every 100-ish loadgen cycles. At 1800 loadgen cycles, I estimate each GC call is taking about 450ms, which should let us get to 20k cycles before causing blocks to be delayed.

Fixing that in XS is an easier/more-direct/more-correct mitigation than any other changes we might make to the GC policy
.

dckc · 2021-05-03T15:35:54Z

I'm inclined to investigate harden() performance in isolation. Adding performance.now() should help. p.s. done in e29b680

side note: using jq to filter the slog should make a wider variety of techniques feasible in that ipython notebook.

dckc · 2021-05-03T21:32:36Z

Try endojs/endo#704, as an experiment?

I tried this out; performance looked ~2x slower:

size	dur	pr704
1024	0.005422	0.012111
2048	0.012071	0.024621
2048	0.012596	0.021057
4096	0.022986	0.052297
4096	0.021754	0.050533
8192	0.054163	0.134920
8192	0.054778	0.121982
16384	0.120454	0.379237
16384	0.126119	0.251657
32768	0.340179	1.198049
32768	0.319713	0.715910

the numbers above are the duration between start hardening keys and done hardening:

  function bench() {
    // 12 iterations, 65,536 objects, ~never finishes
    for (let iter = 0; iter < 11; iter += 1) {
      const size = 1024 * (1 << Math.round(iter / 2));
      print('iter', iter, size);
      event('prep iter', iter, size);
      const keys = new Array(size);
      for (let ix = 0; ix < size; ix += 1) {
        keys[ix] = {};
      }
      event('start loop over keys', iter, size);
      for (const _key of keys) {
        // how long does looping take?
      }
      event('start hardening keys', iter, size);
      for (const key of keys) {
        harden(key);
      }
      event('done hardening', iter, size);
    }
  }

dckc · 2021-05-03T23:01:13Z

dckc · 2021-05-04T20:59:44Z

@warner concurs: we'll track remaining work in future milestones.

dckc · 2021-05-05T14:24:44Z

On second thought, we're actively monitoring this issue in phase 3, so let's track it there for a while longer...

dckc · 2021-05-13T16:38:10Z

@erights for WeakMap performance testing, the high resolution timer test in test-xs-perf.js should make a good template.

dckc · 2021-06-22T22:55:12Z

@michaelfig don't wait for XS array growth optimizations for today's release.

p.s. from Dean Fri, Jun 11, 1:05 PM KC time:
array growth spreadsheet

dckc · 2021-07-16T22:11:35Z

based on our July 12 discussion, I expect this to go in our upcoming metering milestone.

a peek at the xs commit log shows no news just yet (last change: July 10 bigint fix 1cab0125b0e)

dckc · 2021-07-29T18:53:12Z

looks like the latest from moddable 14e252e8 has relevant stuff...

in particular, in xsMapSet.c:

void fxResizeEntries(txMachine* the, txSlot* table, txSlot* list)

fixes #3012

XS implementation now has conventional O(n log(n)) algorithm. - xs-meter-9 represents XS Map/Set optimizations - style: update indentaion in test-xs-perf.js fixes #3012

*BREAKING CHANGE:* XS Map/Set optimizations lead to compute meter changes; hence `METER_TYPE = 'xs-meter-9'` fixes #3012

warner added the bug Something isn't working label May 1, 2021

warner self-assigned this May 1, 2021

warner added the xsnap the XS execution tool label May 1, 2021

dckc self-assigned this May 3, 2021

This was referenced May 4, 2021

feat(xsnap): performance monitoring / metering #3023

Merged

feat: default to xs-worker in chain #2995

Merged

dckc added this to the Testnet: Staking Dynamics milestone May 4, 2021

dckc modified the milestones: Testnet: Staking Dynamics, Testnet: Stress Test Phase May 4, 2021

dckc modified the milestones: Testnet: Stress Test Phase, Testnet: Staking Dynamics May 5, 2021

dckc removed their assignment May 13, 2021

dckc mentioned this issue May 15, 2021

spacebank: vat storage rental economics #2631

Open

dckc modified the milestones: Testnet: Staking Dynamics, Testnet: Stress Test Phase May 19, 2021

dckc changed the title ~~xs performance investigation~~ xs performance investigation (array growth) Jun 21, 2021

dckc mentioned this issue Jun 21, 2021

transposed weakmaps - performance cliffs in XS revealed by #3105 #3118

Closed

dckc modified the milestones: Testnet: Stress Test Phase, testnet4-b Jun 22, 2021

dckc removed this from the testnet4-b milestone Jun 30, 2021

dckc assigned dckc and unassigned warner Jul 16, 2021

dckc added this to the Testnet: Metering Phase milestone Jul 16, 2021

dckc added the moddable-P1 label Jul 22, 2021

dckc mentioned this issue Jul 22, 2021

optimize array growth using over-allocation Moddable-OpenSource/moddable#671

Closed

dckc changed the title ~~xs performance investigation (array growth)~~ xs performance investigation (array, Map/Set optimizations) Jul 22, 2021

dckc mentioned this issue Jul 22, 2021

Optimize Map/Set lookup Moddable-OpenSource/moddable#672

Closed

dckc added a commit that referenced this issue Jul 29, 2021

fix(xsnap): array, Map/Set optimizations

3bfb197

fixes #3012

dckc mentioned this issue Jul 29, 2021

fix(xsnap)!: avoid O(n^2) Array, Map, Set growth #3560

Merged

1 task

dckc added a commit that referenced this issue Jul 30, 2021

fix(xsnap): array, Map/Set optimizations

3057694

fixes #3012

dckc added a commit that referenced this issue Jul 30, 2021

fix(xsnap)!: avoid O(n^2) Array, Map, Set growth

11e7c1c

XS implementation now has conventional O(n log(n)) algorithm. - xs-meter-9 represents XS Map/Set optimizations - style: update indentaion in test-xs-perf.js fixes #3012

dckc closed this as completed in #3560 Jul 30, 2021

dckc added a commit that referenced this issue Jul 30, 2021

fix(xsnap)!: avoid O(n^2) Array, Map, Set growth

fa5e3ea

*BREAKING CHANGE:* XS Map/Set optimizations lead to compute meter changes; hence `METER_TYPE = 'xs-meter-9'` fixes #3012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xs performance investigation (array, Map/Set optimizations) #3012

xs performance investigation (array, Map/Set optimizations) #3012

warner commented May 1, 2021 •

edited by dckc

Loading

dtribble commented May 1, 2021 •

edited

Loading

warner commented May 1, 2021

dckc commented May 1, 2021

dtribble commented May 2, 2021

erights commented May 2, 2021 •

edited

Loading

dckc commented May 2, 2021

erights commented May 2, 2021

warner commented May 2, 2021

dckc commented May 2, 2021

erights commented May 2, 2021 •

edited

Loading

warner commented May 3, 2021

dckc commented May 3, 2021 •

edited

Loading

dckc commented May 3, 2021

dckc commented May 3, 2021 •

edited

Loading

dckc commented May 4, 2021

dckc commented May 5, 2021

dckc commented May 13, 2021

dckc commented Jun 22, 2021 •

edited

Loading

dckc commented Jul 16, 2021

dckc commented Jul 29, 2021 •

edited

Loading

xs performance investigation (array, Map/Set optimizations) #3012

xs performance investigation (array, Map/Set optimizations) #3012

Comments

warner commented May 1, 2021 • edited by dckc Loading

dtribble commented May 1, 2021 • edited Loading

warner commented May 1, 2021

dckc commented May 1, 2021

dtribble commented May 2, 2021

erights commented May 2, 2021 • edited Loading

dckc commented May 2, 2021

erights commented May 2, 2021

warner commented May 2, 2021

dckc commented May 2, 2021

erights commented May 2, 2021 • edited Loading

warner commented May 3, 2021

dckc commented May 3, 2021 • edited Loading

dckc commented May 3, 2021

dckc commented May 3, 2021 • edited Loading

dckc commented May 4, 2021

dckc commented May 5, 2021

dckc commented May 13, 2021

dckc commented Jun 22, 2021 • edited Loading

dckc commented Jul 16, 2021

dckc commented Jul 29, 2021 • edited Loading

warner commented May 1, 2021 •

edited by dckc

Loading

dtribble commented May 1, 2021 •

edited

Loading

erights commented May 2, 2021 •

edited

Loading

erights commented May 2, 2021 •

edited

Loading

dckc commented May 3, 2021 •

edited

Loading

dckc commented May 3, 2021 •

edited

Loading

dckc commented Jun 22, 2021 •

edited

Loading

dckc commented Jul 29, 2021 •

edited

Loading