Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark latency and storage of new events DB design #191

Closed
Tracked by #115
psheth9 opened this issue May 30, 2024 · 11 comments
Closed
Tracked by #115

Benchmark latency and storage of new events DB design #191

psheth9 opened this issue May 30, 2024 · 11 comments
Assignees

Comments

@psheth9
Copy link
Contributor

psheth9 commented May 30, 2024

TODO:

  • Ideal load of events for 1 week: compare with ETH rpc
  • Benchmarking code
  • Run benchmark for various configs e.g. total events, N (ledger restriction param), filter
  • Ideally if we can figure out N for the worst case fetch query it should work for all other case.

Ideally we should run benchmarking on standard EC2 instance or find a way to limit core and memory while running the benchmark (e.g. passing cpu_limit to benchmark commands)

@psheth9
Copy link
Contributor Author

psheth9 commented Jul 18, 2024

Doc to track all benchmarking datapoints: https://docs.google.com/document/d/14PTIC1dTNZBbuAtI91baFSPv9se4rO9sqcphV-81Yo8/edit

@psheth9
Copy link
Contributor Author

psheth9 commented Jul 18, 2024

Crux of benchmarking code:

Total events = 1 mil * 3 (different types of events)

for i := 1; i < 1000000; i++ {
		ledgerCloseMeta := ledgerCloseMetaWithEvents(uint32(i), now.Unix(), txMeta...)
		require.NoError(b, ledgerW.InsertLedger(ledgerCloseMeta), "ingestion failed for ledger ")
		require.NoError(b, eventW.InsertEvents(ledgerCloseMeta), "ingestion failed for events ")
	}

// Iterations:


	for i := 0; i < b.N; i++ {
		_, err := handler.getEvents(ctx, request)
		if err != nil {
			b.Errorf("getEvents failed: %v", err)
		}
	}

Entire code here in gist

It is most likely that we are going to throw away this code as we do not want to ingest these much data as a part of running tests.

@psheth9
Copy link
Contributor Author

psheth9 commented Jul 18, 2024

Some initial benchmarks:

Command: cpulimit -l 50 -- go test -bench=BenchmarkGetEvents -benchmem

Here N = ledger restriction number meaning how many ledger we look at for given fetch query.

 N = 100 and events = 1 million events * 3
511	   2474137 ns/op	 2550335 B/op	   28160 allocs/op → 

latency -> 2.47 ms

N = 4000 and events = 1 million events * 3
12	  86983101 ns/op	101725765 B/op	 1104593 allocs/op

latency -> 86.98 ms
N = 7000 and events = 1 million events * 3
7	 152827661 ns/op	177867984 B/op	 1932595 allocs/op

latency -> 152.82 ms

As we can see here, latency goes up as N goes upwards. We ideally want a sweet spot of both N being not too high/low and latency with the upper bound of 20-50 ms range.

Note: Current avg latency of getEvents is at around 4-6 ms but with the new DB design it will go up.

@2opremio
Copy link
Contributor

Uhm, I am sure the allocations can reduced.

@psheth9
Copy link
Contributor Author

psheth9 commented Jul 18, 2024

@2opremio What would be an acceptable latency for getEvents. Right now it is super fast (6 ms) bc of in-memory design. Wanted to get some ballpark numbers. Is 50 ms - 70 ms sounds okay?

@2opremio
Copy link
Contributor

I think we should try to get it down as as much as possible (within reason by timeboxing it). My guess is that the allocations can be reduced by quite a bit which should help a lot

@psheth9
Copy link
Contributor Author

psheth9 commented Jul 19, 2024

Thanks for the input. Need to do some detailed memory profiling to analyze the allocs here.
Also I am removing ingestion part out of the benchmark test so that we get accurate details just for fetching events.

N = 2000 and events = 1 million events * 3

Benchmark result: --> 45 ms (not too bad)
 24	  45592375 ns/op	50731471 B/op	  552590 allocs/op

fyi, we will not go lower than 2000 ledger limit

pprof mem.prof:

Time: Jul 19, 2024 at 11:06am (PDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 30510.93MB, 75.86% of 40221.98MB total
Dropped 144 nodes (cum <= 201.11MB)
Showing top 10 nodes out of 142
      flat  flat%   sum%        cum   cum%
12592.35MB 31.31% 31.31% 12592.35MB 31.31%  bytes.growSlice
 6149.86MB 15.29% 46.60% 10586.80MB 26.32%  github.com/stellar/soroban-rpc/cmd/soroban-rpc/internal/methods.ledgerCloseMetaWithEvents
 4742.51MB 11.79% 58.39%  4742.51MB 11.79%  github.com/lann/ps.(*tree).clone (inline)
 1728.20MB  4.30% 62.68%  1890.73MB  4.70%  github.com/Masterminds/squirrel.(*insertData).appendValuesToSQL
 1154.13MB  2.87% 65.55%  4624.07MB 11.50%  github.com/stellar/go/ingest.(*LedgerTransactionReader).storeTransactions
 1093.94MB  2.72% 68.27%  9223.11MB 22.93%  database/sql.driverArgsConnLocked
  878.64MB  2.18% 70.46%   878.64MB  2.18%  reflect.packEface
  793.13MB  1.97% 72.43% 14250.60MB 35.43%  github.com/Masterminds/squirrel.InsertBuilder.Exec
  715.13MB  1.78% 74.21%   715.13MB  1.78%  reflect.New
  663.03MB  1.65% 75.86% 15872.01MB 39.46%  github.com/stellar/soroban-rpc/cmd/soroban-rpc/internal/db.(*eventHandler).InsertEvents

Profile of getEvents: list getEvents

ROUTINE ======================== github.com/stellar/soroban-rpc/cmd/soroban-rpc/internal/methods.eventsRPCHandler.getEvents in /Users/pritsheth/sdf/soroban-rpc/cmd/soroban-rpc/internal/methods/get_events.go
         0     1.24GB (flat, cum)  3.16% of Total
         .          .    339:func (h eventsRPCHandler) getEvents(ctx context.Context, request GetEventsRequest) (GetEventsResponse, error) {
         .          .    340:	if err := request.Valid(h.maxLimit); err != nil {
         .          .    341:		return GetEventsResponse{}, &jrpc2.Error{
         .          .    342:			Code:    jrpc2.InvalidParams,
         .          .    343:			Message: err.Error(),
.
.
.
         .          .    416:		}
         .          .    417:		return uint(len(found)) < limit
         .          .    418:	}
         .          .    419:
         **.     1.24GB    420:	err = h.dbReader.GetEvents(ctx, cursorRange, contractIDs, f)** // DB call -- This is a lot
         .          .    421:	if err != nil {
         .          .    422:		return GetEventsResponse{}, &jrpc2.Error{
         .          .    423:			Code:    jrpc2.InvalidRequest,

@psheth9
Copy link
Contributor Author

psheth9 commented Jul 19, 2024

Much faster with contract_ids in filter: [Indexed] Reduction in allocs as well.

Case: Requested contract_id and Topics not present in DB.

Also removed ingestion part from benchmarking test and got following --> 0.17 ms

N = 2000 and events = 1 million events * 3
9820 178685 ns/op 35591 B/op 626 allocs/op

@psheth9
Copy link
Contributor Author

psheth9 commented Jul 22, 2024

Updated Benchmark doc: https://docs.google.com/document/d/14PTIC1dTNZBbuAtI91baFSPv9se4rO9sqcphV-81Yo8/edit

IMO it is safe to start with 4000 ledger restriction limit. just fyi, We also have following things in place to protect us from intensive fetch request:

  • MaxGetEventsExecutionDuration - 10s
  • Maximum number of events per response - 10K and Default number of events per response - 100

Allocs are still high for worst case query, one possible reason for that is we load all the rows from (x to x + N) range. Dont know if there are any efficient way of doing it to reduce allocs. cc @2opremio

@2opremio
Copy link
Contributor

one possible reason for that is we load all the rows from (x to x + N) range

Maybe you can stream the rows into a smaller slice to build the replay instead of allocating all of them at once?

@2opremio
Copy link
Contributor

You may even be able to use a sync.Pool since the concurrency and maximums are known

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

4 participants