Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: Optimizing transaction processing #14929

Closed
elias-orijtech opened this issue Feb 6, 2023 · 12 comments
Closed

EPIC: Optimizing transaction processing #14929

elias-orijtech opened this issue Feb 6, 2023 · 12 comments
Labels
T:Epic Epics

Comments

@elias-orijtech
Copy link
Contributor

Summary

As brought up in a recent team meeting, optimizing the transaction processing of Cosmos is a top priority. As a point of comparison Cosmos is described as an order (or two?) of magnitude slower than Tendermint itself.

Problem Definition

Performance is important for keeping the resource requirements of Cosmos chains in check, and to alleviate the effect of denial-of-service attacks.

Work Breakdown

As usual for performance optimization,

  • Create or run benchmarks with realistic loads
  • Optimize hotspots.

CC @odeke-em for reference.

CC @tac0turtle to get the ball rolling. What are the most realistic benchmarks to focus on? Are there other issues relevant to this work?

I've played around with the benchmarks and tests in order to find something relevant. make test-sim-benchmarks seems relevant, but

$ make test-sim-benchmark
Running application benchmark for numBlocks=500, blockSize=200. This may take awhile!
main module (github.com/cosmos/cosmos-sdk) does not contain package github.com/cosmos/cosmos-sdk/simapp
make: *** [test-sim-benchmark] Error 1

Running

$ cd simapp
$ go test -mod=readonly -benchmem -run=^$ -bench ^BenchmarkFullAppSimulation$ -Enabled=true -NumBlocks=500 -BlockSize=200 -Commit=true -timeout=24h
goos: darwin
goarch: arm64
pkg: cosmossdk.io/simapp
BenchmarkFullAppSimulation-8   	       1	126039193916 ns/op	86157807584 B/op	1111032767 allocs/op
PASS
ok  	cosmossdk.io/simapp	126.989s

gives me some result, but is it a realistic load?

@github-actions github-actions bot added the needs-triage Issue that needs to be triaged label Feb 6, 2023
@alexanderbez
Copy link
Contributor

This is going to have a very large variance depending on how/what the application does with transactions. My suggestion is to consider utilizing the simulations for benchmarking

@elias-orijtech
Copy link
Contributor Author

How would I utilize the simulations? Isn't simapp a simulator?

@tac0turtle
Copy link
Member

I think this is a fairly large scope that may make sense to breakdown into phases.

There is the ante handler which checks transactions, execution and storage and commitment phases. It might make sense to start with that instead of benchmarking a chain.

Most of these items can be done with out running a chain, or benchmarking a chain instead the components that take part of the execution path. Tx processing is also up to applications so benchmarking modules may not need to be part of the first phase here.

@tac0turtle tac0turtle changed the title Optimizing transaction processing EPIC: Optimizing transaction processing Feb 7, 2023
@tac0turtle tac0turtle added T:Epic Epics and removed needs-triage Issue that needs to be triaged labels Feb 7, 2023
@elias-orijtech
Copy link
Contributor Author

Sounds good to me, in particular the part about leaving out application specific processing for now.

How would I go about running the ante handler?

@alexanderbez
Copy link
Contributor

How would I utilize the simulations? Isn't simapp a simulator?

No, SimApp is a basic reference application implementation. The simulation framework in the SDK uses it for simulations.

@yihuang
Copy link
Collaborator

yihuang commented Feb 8, 2023

With some changes like this, I'm able profile block delivery on production data using tendermint block replay:

  • reset application.db to an old version
  • cronosd start --home /chain/.cronosd --cpu-profile /tmp/cpu.profile, it start replaying blocks at full speed.
  • wait at least 5 seconds, then interrupt the process as you want

In my test run, most of the blocks are empty, the profile result looks like this:

      flat  flat%   sum%        cum   cum%
     6.21s 28.66% 28.66%      6.29s 29.03%  runtime.cgocall
     5.76s 26.58% 55.24%      5.76s 26.58%  [librocksdb.so.7.9.2]
     1.99s  9.18% 64.42%      1.99s  9.18%  [libc.so.6]
     0.79s  3.65% 68.07%      0.79s  3.65%  [liblz4.so.1.9.3]
     0.69s  3.18% 71.25%      0.69s  3.18%  github.com/tendermint/tendermint/types.(*Validator).CompareProposerPriority
     0.59s  2.72% 73.97%      0.59s  2.72%  [libstdc++.so.6.0.28]
     0.33s  1.52% 75.50%      0.92s  4.25%  runtime.mallocgc
     0.32s  1.48% 76.97%      0.32s  1.48%  github.com/tendermint/tendermint/types.safeAdd (inline)
     0.30s  1.38% 78.36%      1.66s  7.66%  github.com/tendermint/tendermint/types.(*ValidatorSet).incrementProposerPriority
     0.24s  1.11% 79.46%      0.93s  4.29%  github.com/tendermint/tendermint/types.(*ValidatorSet).getValWithMostPriority (inline)
     0.24s  1.11% 80.57%      0.79s  3.65%  runtime.scanobject
     0.18s  0.83% 81.40%      0.23s  1.06%  runtime.findObject

What's interesting is tendermint IncrementProposerPriority/CompareProposerPriority pop up as hotspot, there's O(n) processing there, not sure if it's a concern.

@lasarojc
Copy link

lasarojc commented Feb 8, 2023

Even though the applications built on Cosmos may be very different from "regular" applications, it may be worth looking into classical benchmarks to gather extra data points, such as in https://arxiv.org/pdf/2210.04484.pdf

@elias-orijtech
Copy link
Contributor Author

@yihuang that sounds like exactly what I want. Can you please explain to me how I acquire a snapshot application.db and a body of block data to replay?

My only concern is that any snapshot may not have any outlier transactions: unusual transactions taking a disproportionate amount of processing. They're juicy targets for DoS attacks, yet presumably rarely seen in normal transaction logs.

@yihuang
Copy link
Collaborator

yihuang commented Feb 8, 2023

@yihuang that sounds like exactly what I want. Can you please explain to me how I acquire a snapshot application.db and a body of block data to replay?

On startup if tendermint has newer blocks than application.db, it'll replay those block automatically, so you just need to rollback your application.db to an earlier version, there are a few options:

  • restore application.db from a statesync snapshot, there's a PR for restore from local snapshot(Enable local statesync/snapshot restore #13521), not sure of the status.
  • restore to a db backup, or create backup now and wait for it to sync for a while, then restore.

It was convenient for me because I'm developing this "versiondb" feature, where I have developed a set of tools to replay the change set to any target version and dump the IAVL snapshot, also able to restore application.db from those snapshots, so basically I can restore application.db to any version in several minutes, you can find more about them here, should work for any cosmos-sdk chain, since we have the same db structure.

My only concern is that any snapshot may not have any outlier transactions: unusual transactions taking a disproportionate amount of processing. They're juicy targets for DoS attacks, yet presumably rarely seen in normal transaction logs.

yeah, that's hard to detect in benchmarks, you can't cover all the cases, probably need to monitor each block's processing time for abnormal numbers.

@elias-orijtech
Copy link
Contributor Author

How do you do it without having an existing node running? I don't have one locally, but more importantly I think it's crucial to be able to run benchmarks continuously. Otherwise, performance will surely backslide in time.

@yihuang
Copy link
Collaborator

yihuang commented Feb 8, 2023

How do you do it without having an existing node running? I don't have one locally, but more importantly I think it's crucial to be able to run benchmarks continuously. Otherwise, performance will surely backslide in time.

I was just try to get a feel about the production behavior, for benchmarks need to run continuously, we'll need more isolated environment.

@tac0turtle
Copy link
Member

closing this for now as the work is part of a simulator rewrite that is getting started

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T:Epic Epics
Projects
None yet
Development

No branches or pull requests

5 participants