Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Replay slow down around block number 32M running against Mainnet data #6909

Closed
nksanthosh opened this issue Mar 9, 2019 · 9 comments
Closed
Labels

Comments

@nksanthosh
Copy link
Contributor

Nodeos replay with Mainnet data slows down as shown in the attached graph

  1. Restart nodeos @ 32-33 million blocks:

Safely restarting nodeos while replay is between 32 & 33 million blocks seems to optimize total replay time, across all 3 tests. Later than this means the replay stalls significantly, earlier and a second restart is needed or replay may stall prior to completion.

  1. Restart nodeos after replay completes:

Post-replay sync performance (catch up with blocks generated after replay start) begins around 500-1000 blocks per minute. A 2nd restart during this sync increases performance by a factor of about 5 (again, consistent across all 3 tests), and decreases total sync time from ~24 hours to ~4-5 hours.

@nksanthosh
Copy link
Contributor Author

replayissues

@wanderingbort
Copy link
Contributor

@nksanthosh @dconry Please specify the configuration and hardware used to derive this graph so that future readers of this issue can properly understand.

My assumption is that this graph looks very different with wabt as a web assembly back-end even if it is slower in an absolute sense.

@dconry
Copy link

dconry commented Mar 9, 2019

@wanderingbort @nksanthosh Here are some hardware/runtime details for tests above, numbered left to right in graph:

Test 1 (aws m4.4xlarge: 8x2.5GHz CPU/64GB RAM): --replay --wasm-runtime wavm on nodeos 1.6.0rc2
Test 2 (aws m4.4xlarge: 8x2.5GHz/64GB): --replay --wasm-runtime wavm --disable-replay-opts --trace-history --chain-state-history --plugin eosio::state_history_plugin on nodeos 1.6.1
Test 3 (aws z1d.6xlarge: 12x3.4-4.0Ghz/192GB): --replay --wasm-runtime wavm on nodeos 1.6.1

Also, tests 1 & 2 were allowed to process slowly for several days before a nodeos restart unstalled them. Test 3 was restarted promptly (prior to stall) in attempt to avoid stall period. This (plus faster CPU freq) contributed to the faster completion seen in graph for Test 3.

Please let me know if further details would be useful...

@matthewdarwin
Copy link

Might be related to #6533?

@taokayan taokayan added the bug label Mar 12, 2019
@taokayan
Copy link
Contributor

looks like it is a bug. Do you have the memory usage log for the instances?

@dconry
Copy link

dconry commented Mar 18, 2019

@taokayan We do have memory statistics for each system over these intervals. I'll look into extracting the raw data from the monitoring system's API & representing it here.

@dconry
Copy link

dconry commented Mar 19, 2019

Here are summary graphs of RAM usage... files are named to match the items/lines in graph above. Nodeos process is 90%+ of the usage for all three. I'm not sure the raw data is as useful, but it is also available upon request.

First two graphs had monitoring enabled mid-run, so the first few hours are missing. I'm relatively certain the overall shape of the increase was consistent with the existing data.

orig hardware, no history:
nohistory

orig hardware, with state history:
statehistory

new hardware, no history:
newhardware

One takeaway... the slowness problem seems more intense starting around 55-60G of RAM, even on the 3rd test where the machine has (significantly) more than 64G available, and chain-state-db-size-mb was also changed to be higher (see env doc for details).

@cyberluke
Copy link

@spoonincode spoonincode removed their assignment Apr 23, 2019
@spoonincode
Copy link
Contributor

mostly fixed by #7047

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants