Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC performance regression since 22.4.0-RC2 #3959

Closed
garyschulte opened this issue Jun 10, 2022 · 13 comments · Fixed by #4099
Closed

RPC performance regression since 22.4.0-RC2 #3959

garyschulte opened this issue Jun 10, 2022 · 13 comments · Fixed by #4099
Labels
bug Something isn't working TeamChupa GH issues worked on by Chupacabara Team

Comments

@garyschulte
Copy link
Contributor

garyschulte commented Jun 10, 2022

Description

We have had a report of RPC performance regression in besu after upgrading from 22.1.x to the latest quaarterly release versions:
image (1).png

After some version bisecting it appears that the performance regression first appears in 22.4.0-RC2:

image.png

As noted in the chart, versions since 22.4.0-RC2 show a significant latency in rpc response time for light RPC activity (1-2 requests per second).

Acceptance Criteria

  • identify and address the source of the rpc latency

Steps to Reproduce (Bug)

To observe request latency spike on releases since 22.4.0-RC2

  1. use the dev network resources attached to join as a non-validating node
  2. at an average rate of 2 requests per second, issue a mix of RPC requests:
    eth_getBlockByNumber Full and eth_getLogs Full

Expected behavior: no regression in RPC performance

Actual behavior: request latency is unacceptable

Frequency: 100% reproducible with the aforementioned traffic

Versions (Add all that apply)

  • Software version: 22.4.0-RC2 +
  • Java version: jdk 11
  • OS Name & Version: [cat /etc/*release]
  • Kernel Version: [uname -a]
  • Virtual Machine software & version: [vmware -v]
  • Docker Version: [docker version]
  • Cloud VM, type, size: [Amazon Web Services I3-large]

Smart contract information (If you're reporting an issue arising from deploying or calling a smart contract, please supply related information)

  • Solidity version [solc --version]
  • Repo with minimal set of deployable/reproducible contract code - please provide a link
  • Please include specifics on how you are deploying/calling the contract
  • Have you reproduced the issue on other eth clients

Additional Information (Add any of the following or anything else that may be relevant)

  • Besu setup info - genesis file, config options
    cloud k8s instance pod config:
- env:
    - name: JAVA_OPTS
      value: -Xmx16g
    - name: BESU_DATA_PATH
      value: /data
    - name: BESU_SYNC_MODE
      value: FULL
    - name: BESU_BOOTNODES
      value: enode://9cccbaae702d477c5fd4d704a2d6f92a90005f62de980b11b0d042877bf759774cf7d68d358c59427622e87538bc46afa1195d6ac12cb153d6771461c1830d1b@54.243.108.56:30303,enode://d6518f4f318a172158cf73c3e615c4eb488efb14c20b4a2f13570bf01092573222cd6935599a80017512457fb7f229cf6562f9d038b5d0dc98db95074d4a98b3@18.235.247.31:30303
    - name: BESU_RPC_HTTP_ENABLED
      value: "true"
    - name: BESU_RPC_HOST
      value: 0.0.0.0
    - name: BESU_RPC_HTTP_API
      value: ETH,NET,WEB3
    - name: BESU_P2P_PORT
      value: "30303"
    - name: BESU_HOST_ALLOWLIST
      value: '*'
    - name: BESU_NAT_METHOD
      value: NONE
    - name: BESU_GENESIS_FILE
      value: /genesis/genesis.json
    - name: BESU_P2P_HOST
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.annotations['dedicated.infura.io/pod-ip']
  • System info - i3xl, with resource config:
    resources:
      requests:
        cpu: 3500m
        memory: 24Gi,
@garyschulte garyschulte changed the title RPC regression since 22.4.0-RC2 RPC performance regression since 22.4.0-RC2 Jun 10, 2022
@garyschulte
Copy link
Contributor Author

garyschulte commented Jun 10, 2022

see this gist for details on how to connect to the palm dev network where this can be reproduced reliably

@iamhsk iamhsk added the TeamGroot GH issues worked on by Groot Team label Jun 15, 2022
@usmansaleem usmansaleem self-assigned this Jun 20, 2022
@mbaxter
Copy link
Contributor

mbaxter commented Jul 7, 2022

@usmansaleem @garyschulte - has there been any progress on this issue?

@garyschulte
Copy link
Contributor Author

garyschulte commented Jul 8, 2022

@usmansaleem @garyschulte - has there been any progress on this issue?

Yes, @ahamlat has identified that the regression was introduced in #3695 and it appears to be related to netty deferring to a native transport library. Ameziane is out ill currently, but we should be able to have a mitigation for this soon.

@ahamlat
Copy link
Contributor

ahamlat commented Jul 8, 2022

@mbaxter : I compared the performances before and after #3695 and we can notice a real degradation. I used a benchmark created by @usmansaleem with the same throughput during 5 minutes. Here are the results :

Before PR 3695

================================================================================
---- Global Information --------------------------------------------------------
request count 1501 (OK=1501 KO=0 )
min response time 1 (OK=1 KO=- )
max response time 358 (OK=358 KO=- )
> mean response time 39 (OK=39 KO=- )
std deviation 51 (OK=51 KO=- )
> response time 50th percentile 12 (OK=12 KO=- )
response time 75th percentile 58 (OK=58 KO=- )
response time 95th percentile 144 (OK=144 KO=- )
> response time 99th percentile 221 (OK=221 KO=- )
mean requests/sec 4.987 (OK=4.987 KO=- )
---- Response Time Distribution ------------------------------------------------
t < 800 ms 1501 (100%)
800 ms < t < 1200 ms 0 ( 0%)
t > 1200 ms 0 ( 0%)
failed 0 ( 0%)
================================================================================

After PR 3695

================================================================================
---- Global Information --------------------------------------------------------

request count 1501 (OK=1501 KO=0 )
min response time 1 (OK=1 KO=- )
max response time 20834 (OK=20834 KO=- )
> mean response time 484 (OK=484 KO=- )
std deviation 1830 (OK=1830 KO=- )
> response time 50th percentile 227 (OK=227 KO=- )
response time 75th percentile 274 (OK=274 KO=- )
response time 95th percentile 539 (OK=539 KO=- )
> response time 99th percentile 12279 (OK=12279 KO=- )
mean requests/sec 4.618 (OK=4.618 KO=- )
---- Response Time Distribution ------------------------------------------------
t < 800 ms 1444 ( 96%)
800 ms < t < 1200 ms 9 ( 1%)
t > 1200 ms 48 ( 3%)
failed 0 ( 0%)
================================================================================

As Besu is consuming a lot of CPU during these tests, I did some CPU profiling to see which is the hot spot in the two tests

Before PR 3695

image

CPU hot spot : RocksDB.get

After PR 3695

image

CPU hot spot : some calls to native librairies

@diega : Do you think we can improve the performance of this PR #3695 ? I was thinking for example about testing Java implementations instead of native librairies

@mbaxter
Copy link
Contributor

mbaxter commented Jul 8, 2022

This is a major regression - it is blocking upgrades to RPC nodes on the Palm network. Have you considered reverting the problematic PR while a fix is worked out?

@mbaxter mbaxter added the bug Something isn't working label Jul 8, 2022
@garyschulte
Copy link
Contributor Author

It is a pretty major refactor from a couple months back, it might be more difficult to revert it than to mitigate or fix it. We will prioritize a fix one way or the other.

@mbaxter
Copy link
Contributor

mbaxter commented Jul 8, 2022

I took a quick look at the revert conflicts - it doesn't look too bad to me. It really depends on timing, if there's an imminent fix, probably makes sense to just wait on the fix. But if there's weeks of exploratory work, probably better to just revert it.

@diega
Copy link
Contributor

diega commented Jul 12, 2022

@mbaxter if you have the environment for running the test, would you try just reverting 70a01962? This commit has been lost in the squash but that's the one enabling native transport usage. If the regression is still present after reverting this one, then the problem is somewhere else in the refactor.

@iamhsk iamhsk removed the TeamGroot GH issues worked on by Groot Team label Jul 12, 2022
@non-fungible-nelson
Copy link
Contributor

@diega suggested testing the impact of adding ordered=false in the declaration of the blockingHandler for the HTTP endpoint (ref: https://vertx.io/docs/vertx-web/java/#_using_blocking_handlers)

Add false as second argument here: https://github.com/hyperledger/besu/blob/main/ethereum/api/src/main/java/org/hyperledger/besu/ethereum/api/jsonrpc/JsonRpcHttpService.java#L333 and here https://github.com/hyperledger/besu/blob/main/ethereum/api/src/main/java/org/hyperledger/besu/ethereum/api/jsonrpc/JsonRpcHttpService.java#L344

I am paraphrasing his suggestion - but anyone looking at the defect could try the above @mbaxter

@mbaxter
Copy link
Contributor

mbaxter commented Jul 14, 2022

I ran @usmansaleem's performance tool against the Palm Testnet at version 22.4.0-RC1(before the regression was released) and compared that to main, but it doesn't look like the issue is reproducing for me unless I'm reading the output wrong.

At 22.4.0-RC1 (before the regression):

================================================================================
---- Global Information --------------------------------------------------------
> request count                                       1501 (OK=1501   KO=0     )
> min response time                                      1 (OK=1      KO=-     )
> max response time                                    982 (OK=982    KO=-     )
> mean response time                                     6 (OK=6      KO=-     )
> std deviation                                         26 (OK=26     KO=-     )
> response time 50th percentile                          4 (OK=4      KO=-     )
> response time 75th percentile                          6 (OK=6      KO=-     )
> response time 95th percentile                         11 (OK=11     KO=-     )
> response time 99th percentile                         35 (OK=35     KO=-     )
> mean requests/sec                                   4.97 (OK=4.97   KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                          1500 (100%)
> 800 ms < t < 1200 ms                                   1 (  0%)
> t > 1200 ms                                            0 (  0%)
> failed                                                 0 (  0%)
================================================================================

And on main:

================================================================================
---- Global Information --------------------------------------------------------
> request count                                       1501 (OK=1501   KO=0     )
> min response time                                      1 (OK=1      KO=-     )
> max response time                                    601 (OK=601    KO=-     )
> mean response time                                    11 (OK=11     KO=-     )
> std deviation                                         18 (OK=18     KO=-     )
> response time 50th percentile                         10 (OK=10     KO=-     )
> response time 75th percentile                         14 (OK=14     KO=-     )
> response time 95th percentile                         20 (OK=20     KO=-     )
> response time 99th percentile                         57 (OK=57     KO=-     )
> mean requests/sec                                  4.987 (OK=4.987  KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                          1501 (100%)
> 800 ms < t < 1200 ms                                   0 (  0%)
> t > 1200 ms                                            0 (  0%)
> failed                                                 0 (  0%)
================================================================================

@usmansaleem - any suggestions on the gatling tests?

@usmansaleem
Copy link
Member

@mbaxter I used https://github.com/ConsenSys-Palm/palm-node to create the palm instance and updated the bootnodes etc. as mentioned in this ticket. I set up Besu 22.4.3, not the RC1 that was reported in the ticket. However, I was not able to see the regression in my tests (I even increased the load to 50 total users/sec). However, @ahamlat was able to reproduce regression in the environment that he sat up (I believe he sat up mainnet, not palm dev env).

@diega
Copy link
Contributor

diega commented Jul 14, 2022

I ran the besu-gatling default test suite against a fully synced palm dev node (HEAD at 18_438_825) with the following outputs

Output for main (b85b39d30)

================================================================================
---- Global Information --------------------------------------------------------
> request count                                       1501 (OK=1501   KO=0     )
> min response time                                      1 (OK=1      KO=-     )
> max response time                                   3493 (OK=3493   KO=-     )
> mean response time                                     9 (OK=9      KO=-     )
> std deviation                                         90 (OK=90     KO=-     )
> response time 50th percentile                          7 (OK=7      KO=-     )
> response time 75th percentile                          9 (OK=9      KO=-     )
> response time 95th percentile                         10 (OK=10     KO=-     )
> response time 99th percentile                         14 (OK=14     KO=-     )
> mean requests/sec                                  4.938 (OK=4.938  KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                          1500 (100%)
> 800 ms < t < 1200 ms                                   0 (  0%)
> t > 1200 ms                                            1 (  0%)
> failed                                                 0 (  0%)
================================================================================

Output for 22.4.4 (eefa93fd6)

================================================================================
---- Global Information --------------------------------------------------------
> request count                                       1501 (OK=1501   KO=0     )
> min response time                                      1 (OK=1      KO=-     )
> max response time                                   4033 (OK=4033   KO=-     )
> mean response time                                     9 (OK=9      KO=-     )
> std deviation                                        104 (OK=104    KO=-     )
> response time 50th percentile                          7 (OK=7      KO=-     )
> response time 75th percentile                          8 (OK=8      KO=-     )
> response time 95th percentile                         12 (OK=12     KO=-     )
> response time 99th percentile                         16 (OK=16     KO=-     )
> mean requests/sec                                  4.921 (OK=4.921  KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                          1500 (100%)
> 800 ms < t < 1200 ms                                   0 (  0%)
> t > 1200 ms                                            1 (  0%)
> failed                                                 0 (  0%)
================================================================================

Output for diega/3959_fix (4801d9e70)

================================================================================
---- Global Information --------------------------------------------------------
> request count                                       1501 (OK=1501   KO=0     )
> min response time                                      2 (OK=2      KO=-     )
> max response time                                   4029 (OK=4029   KO=-     )
> mean response time                                     9 (OK=9      KO=-     )
> std deviation                                        104 (OK=104    KO=-     )
> response time 50th percentile                          5 (OK=5      KO=-     )
> response time 75th percentile                          9 (OK=9      KO=-     )
> response time 95th percentile                         11 (OK=11     KO=-     )
> response time 99th percentile                         22 (OK=22     KO=-     )
> mean requests/sec                                  4.921 (OK=4.921  KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                          1500 (100%)
> 800 ms < t < 1200 ms                                   0 (  0%)
> t > 1200 ms                                            1 (  0%)
> failed                                                 0 (  0%)
================================================================================

based on these numbers, I can say I can not reproduce the reported performance degradation

@iamhsk iamhsk added the TeamChupa GH issues worked on by Chupacabara Team label Jul 14, 2022
@ahamlat
Copy link
Contributor

ahamlat commented Jul 19, 2022

I'm not able to reproduce anymore, here are the results :

  1. Version 22.4.3

curl -X POST --data '{"jsonrpc":"2.0","method":"web3_clientVersion","params":[],"id":1}' localhost:8545
{
"jsonrpc" : "2.0",
"id" : 1,
"result" : "besu/v22.4.3/linux-x86_64/corretto-java-17"
}

Benchmark output

================================================================================
---- Global Information --------------------------------------------------------

request count 1501 (OK=1501 KO=0 )
min response time 1 (OK=1 KO=- )
max response time 767 (OK=767 KO=- )
> mean response time 91 (OK=91 KO=- )
std deviation 76 (OK=76 KO=- )
> response time 50th percentile 69 (OK=69 KO=- )
response time 75th percentile 134 (OK=134 KO=- )
response time 95th percentile 219 (OK=219 KO=- )
> response time 99th percentile 339 (OK=339 KO=- )
mean requests/sec 4.987 (OK=4.987 KO=- )
---- Response Time Distribution ------------------------------------------------
t < 800 ms 1501 (100%)
800 ms < t < 1200 ms 0 ( 0%)
t > 1200 ms 0 ( 0%)
failed 0 ( 0%)
================================================================================

  1. Version Diego's fix

curl -X POST --data '{"jsonrpc":"2.0","method":"web3_clientVersion","params":[],"id":1}' localhost:8545
{
"jsonrpc" : "2.0",
"id" : 1,
"result" : "besu/v22.7.0-RC2-dev-4801d9e7/linux-x86_64/corretto-java-17"
}

Benchmark output

================================================================================
---- Global Information --------------------------------------------------------

request count 1501 (OK=1501 KO=0 )
min response time 1 (OK=1 KO=- )
max response time 928 (OK=928 KO=- )
> mean response time 64 (OK=64 KO=- )
std deviation 60 (OK=60 KO=- )
> response time 50th percentile 55 (OK=55 KO=- )
response time 75th percentile 84 (OK=84 KO=- )
response time 95th percentile 159 (OK=159 KO=- )
> response time 99th percentile 238 (OK=238 KO=- )
mean requests/sec 4.987 (OK=4.987 KO=- )
---- Response Time Distribution ------------------------------------------------
t < 800 ms 1500 (100%)
800 ms < t < 1200 ms 1 ( 0%)
t > 1200 ms 0 ( 0%)
failed 0 ( 0%)
================================================================================

  1. Version PR 3695

curl -X POST --data '{"jsonrpc":"2.0","method":"web3_clientVersion","params":[],"id":1}' localhost:8545
{
"jsonrpc" : "2.0",
"id" : 1,
"result" : "besu/v22.4.0-RC2-dev-70a01962/linux-x86_64/corretto-java-17"
}

Benchmark output

================================================================================
---- Global Information --------------------------------------------------------

request count 1501 (OK=1501 KO=0 )
min response time 1 (OK=1 KO=- )
max response time 404 (OK=404 KO=- )
> mean response time 52 (OK=52 KO=- )
std deviation 53 (OK=53 KO=- )
> response time 50th percentile 35 (OK=35 KO=- )
response time 75th percentile 71 (OK=71 KO=- )
response time 95th percentile 150 (OK=150 KO=- )
> response time 99th percentile 248 (OK=248 KO=- )
mean requests/sec 4.987 (OK=4.987 KO=- )
---- Response Time Distribution ------------------------------------------------
t < 800 ms 1501 (100%)
800 ms < t < 1200 ms 0 ( 0%)
t > 1200 ms 0 ( 0%)
failed 0 ( 0%)
================================================================================

I had the same results with Java 11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working TeamChupa GH issues worked on by Chupacabara Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants