Benchmark latency and storage of new events DB design #191

psheth9 · 2024-05-30T18:42:59Z

TODO:

Ideal load of events for 1 week: compare with ETH rpc
Benchmarking code
Run benchmark for various configs e.g. total events, N (ledger restriction param), filter
Ideally if we can figure out N for the worst case fetch query it should work for all other case.

Ideally we should run benchmarking on standard EC2 instance or find a way to limit core and memory while running the benchmark (e.g. passing cpu_limit to benchmark commands)

psheth9 · 2024-07-18T07:42:31Z

Doc to track all benchmarking datapoints: https://docs.google.com/document/d/14PTIC1dTNZBbuAtI91baFSPv9se4rO9sqcphV-81Yo8/edit

psheth9 · 2024-07-18T07:53:49Z

Crux of benchmarking code:

Total events = 1 mil * 3 (different types of events)

for i := 1; i < 1000000; i++ {
		ledgerCloseMeta := ledgerCloseMetaWithEvents(uint32(i), now.Unix(), txMeta...)
		require.NoError(b, ledgerW.InsertLedger(ledgerCloseMeta), "ingestion failed for ledger ")
		require.NoError(b, eventW.InsertEvents(ledgerCloseMeta), "ingestion failed for events ")
	}

// Iterations:


	for i := 0; i < b.N; i++ {
		_, err := handler.getEvents(ctx, request)
		if err != nil {
			b.Errorf("getEvents failed: %v", err)
		}
	}

Entire code here in gist

It is most likely that we are going to throw away this code as we do not want to ingest these much data as a part of running tests.

psheth9 · 2024-07-18T08:03:30Z

Some initial benchmarks:

Command: cpulimit -l 50 -- go test -bench=BenchmarkGetEvents -benchmem

Here N = ledger restriction number meaning how many ledger we look at for given fetch query.

 N = 100 and events = 1 million events * 3
511	   2474137 ns/op	 2550335 B/op	   28160 allocs/op → 

latency -> 2.47 ms

N = 4000 and events = 1 million events * 3
12	  86983101 ns/op	101725765 B/op	 1104593 allocs/op

latency -> 86.98 ms

N = 7000 and events = 1 million events * 3
7	 152827661 ns/op	177867984 B/op	 1932595 allocs/op

latency -> 152.82 ms

As we can see here, latency goes up as N goes upwards. We ideally want a sweet spot of both N being not too high/low and latency with the upper bound of 20-50 ms range.

Note: Current avg latency of getEvents is at around 4-6 ms but with the new DB design it will go up.

2opremio · 2024-07-18T23:16:13Z

Uhm, I am sure the allocations can reduced.

psheth9 · 2024-07-18T23:22:28Z

@2opremio What would be an acceptable latency for getEvents. Right now it is super fast (6 ms) bc of in-memory design. Wanted to get some ballpark numbers. Is 50 ms - 70 ms sounds okay?

2opremio · 2024-07-18T23:24:49Z

I think we should try to get it down as as much as possible (within reason by timeboxing it). My guess is that the allocations can be reduced by quite a bit which should help a lot

psheth9 · 2024-07-19T18:26:27Z

Thanks for the input. Need to do some detailed memory profiling to analyze the allocs here.
Also I am removing ingestion part out of the benchmark test so that we get accurate details just for fetching events.

N = 2000 and events = 1 million events * 3

Benchmark result: --> 45 ms (not too bad)
 24	  45592375 ns/op	50731471 B/op	  552590 allocs/op

fyi, we will not go lower than 2000 ledger limit

pprof mem.prof:

Time: Jul 19, 2024 at 11:06am (PDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 30510.93MB, 75.86% of 40221.98MB total
Dropped 144 nodes (cum <= 201.11MB)
Showing top 10 nodes out of 142
      flat  flat%   sum%        cum   cum%
12592.35MB 31.31% 31.31% 12592.35MB 31.31%  bytes.growSlice
 6149.86MB 15.29% 46.60% 10586.80MB 26.32%  github.com/stellar/soroban-rpc/cmd/soroban-rpc/internal/methods.ledgerCloseMetaWithEvents
 4742.51MB 11.79% 58.39%  4742.51MB 11.79%  github.com/lann/ps.(*tree).clone (inline)
 1728.20MB  4.30% 62.68%  1890.73MB  4.70%  github.com/Masterminds/squirrel.(*insertData).appendValuesToSQL
 1154.13MB  2.87% 65.55%  4624.07MB 11.50%  github.com/stellar/go/ingest.(*LedgerTransactionReader).storeTransactions
 1093.94MB  2.72% 68.27%  9223.11MB 22.93%  database/sql.driverArgsConnLocked
  878.64MB  2.18% 70.46%   878.64MB  2.18%  reflect.packEface
  793.13MB  1.97% 72.43% 14250.60MB 35.43%  github.com/Masterminds/squirrel.InsertBuilder.Exec
  715.13MB  1.78% 74.21%   715.13MB  1.78%  reflect.New
  663.03MB  1.65% 75.86% 15872.01MB 39.46%  github.com/stellar/soroban-rpc/cmd/soroban-rpc/internal/db.(*eventHandler).InsertEvents

Profile of getEvents: list getEvents

ROUTINE ======================== github.com/stellar/soroban-rpc/cmd/soroban-rpc/internal/methods.eventsRPCHandler.getEvents in /Users/pritsheth/sdf/soroban-rpc/cmd/soroban-rpc/internal/methods/get_events.go
         0     1.24GB (flat, cum)  3.16% of Total
         .          .    339:func (h eventsRPCHandler) getEvents(ctx context.Context, request GetEventsRequest) (GetEventsResponse, error) {
         .          .    340:	if err := request.Valid(h.maxLimit); err != nil {
         .          .    341:		return GetEventsResponse{}, &jrpc2.Error{
         .          .    342:			Code:    jrpc2.InvalidParams,
         .          .    343:			Message: err.Error(),
.
.
.
         .          .    416:		}
         .          .    417:		return uint(len(found)) < limit
         .          .    418:	}
         .          .    419:
         **.     1.24GB    420:	err = h.dbReader.GetEvents(ctx, cursorRange, contractIDs, f)** // DB call -- This is a lot
         .          .    421:	if err != nil {
         .          .    422:		return GetEventsResponse{}, &jrpc2.Error{
         .          .    423:			Code:    jrpc2.InvalidRequest,

psheth9 · 2024-07-19T22:15:22Z

Much faster with contract_ids in filter: [Indexed] Reduction in allocs as well.

Case: Requested contract_id and Topics not present in DB.

Also removed ingestion part from benchmarking test and got following --> 0.17 ms

N = 2000 and events = 1 million events * 3
9820 178685 ns/op 35591 B/op 626 allocs/op

psheth9 · 2024-07-22T23:36:33Z

Updated Benchmark doc: https://docs.google.com/document/d/14PTIC1dTNZBbuAtI91baFSPv9se4rO9sqcphV-81Yo8/edit

IMO it is safe to start with 4000 ledger restriction limit. just fyi, We also have following things in place to protect us from intensive fetch request:

MaxGetEventsExecutionDuration - 10s
Maximum number of events per response - 10K and Default number of events per response - 100

Allocs are still high for worst case query, one possible reason for that is we load all the rows from (x to x + N) range. Dont know if there are any efficient way of doing it to reduce allocs. cc @2opremio

2opremio · 2024-08-25T09:35:06Z

one possible reason for that is we load all the rows from (x to x + N) range

Maybe you can stream the rows into a smaller slice to build the replay instead of allocating all of them at once?

2opremio · 2024-08-25T09:36:56Z

You may even be able to use a sync.Pool since the concurrency and maximums are known

psheth9 mentioned this issue May 30, 2024

[epic] move in-memory datastructures to the database #115

Closed

mollykarcher added the rpc-sdk-scrum label Jul 2, 2024

mollykarcher added this to the platform sprint 48 milestone Jul 2, 2024

mollykarcher assigned psheth9 Jul 2, 2024

mollykarcher modified the milestones: platform sprint 48, platform sprint 49 Jul 16, 2024

Shaptic modified the milestones: platform sprint 49, platform sprint 50 Aug 13, 2024

psheth9 closed this as completed Aug 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark latency and storage of new events DB design #191

Benchmark latency and storage of new events DB design #191

psheth9 commented May 30, 2024 •

edited

Loading

psheth9 commented Jul 18, 2024

psheth9 commented Jul 18, 2024

psheth9 commented Jul 18, 2024

2opremio commented Jul 18, 2024

psheth9 commented Jul 18, 2024

2opremio commented Jul 18, 2024

psheth9 commented Jul 19, 2024

psheth9 commented Jul 19, 2024

psheth9 commented Jul 22, 2024

2opremio commented Aug 25, 2024

2opremio commented Aug 25, 2024

Benchmark latency and storage of new events DB design #191

Benchmark latency and storage of new events DB design #191

Comments

psheth9 commented May 30, 2024 • edited Loading

psheth9 commented Jul 18, 2024

psheth9 commented Jul 18, 2024

psheth9 commented Jul 18, 2024

2opremio commented Jul 18, 2024

psheth9 commented Jul 18, 2024

2opremio commented Jul 18, 2024

psheth9 commented Jul 19, 2024

psheth9 commented Jul 19, 2024

psheth9 commented Jul 22, 2024

2opremio commented Aug 25, 2024

2opremio commented Aug 25, 2024

psheth9 commented May 30, 2024 •

edited

Loading