Replay slow down around block number 32M running against Mainnet data #6909

nksanthosh · 2019-03-09T19:27:08Z

Nodeos replay with Mainnet data slows down as shown in the attached graph

Restart nodeos @ 32-33 million blocks:

Safely restarting nodeos while replay is between 32 & 33 million blocks seems to optimize total replay time, across all 3 tests. Later than this means the replay stalls significantly, earlier and a second restart is needed or replay may stall prior to completion.

Restart nodeos after replay completes:

Post-replay sync performance (catch up with blocks generated after replay start) begins around 500-1000 blocks per minute. A 2nd restart during this sync increases performance by a factor of about 5 (again, consistent across all 3 tests), and decreases total sync time from ~24 hours to ~4-5 hours.

nksanthosh · 2019-03-09T19:30:47Z

wanderingbort · 2019-03-09T20:29:26Z

@nksanthosh @dconry Please specify the configuration and hardware used to derive this graph so that future readers of this issue can properly understand.

My assumption is that this graph looks very different with wabt as a web assembly back-end even if it is slower in an absolute sense.

dconry · 2019-03-09T20:48:14Z

@wanderingbort @nksanthosh Here are some hardware/runtime details for tests above, numbered left to right in graph:

Test 1 (aws m4.4xlarge: 8x2.5GHz CPU/64GB RAM): --replay --wasm-runtime wavm on nodeos 1.6.0rc2
Test 2 (aws m4.4xlarge: 8x2.5GHz/64GB): --replay --wasm-runtime wavm --disable-replay-opts --trace-history --chain-state-history --plugin eosio::state_history_plugin on nodeos 1.6.1
Test 3 (aws z1d.6xlarge: 12x3.4-4.0Ghz/192GB): --replay --wasm-runtime wavm on nodeos 1.6.1

Also, tests 1 & 2 were allowed to process slowly for several days before a nodeos restart unstalled them. Test 3 was restarted promptly (prior to stall) in attempt to avoid stall period. This (plus faster CPU freq) contributed to the faster completion seen in graph for Test 3.

Please let me know if further details would be useful...

matthewdarwin · 2019-03-10T02:55:16Z

Might be related to #6533?

taokayan · 2019-03-12T08:08:11Z

looks like it is a bug. Do you have the memory usage log for the instances?

dconry · 2019-03-18T20:00:20Z

@taokayan We do have memory statistics for each system over these intervals. I'll look into extracting the raw data from the monitoring system's API & representing it here.

dconry · 2019-03-19T18:33:09Z

Here are summary graphs of RAM usage... files are named to match the items/lines in graph above. Nodeos process is 90%+ of the usage for all three. I'm not sure the raw data is as useful, but it is also available upon request.

First two graphs had monitoring enabled mid-run, so the first few hours are missing. I'm relatively certain the overall shape of the increase was consistent with the existing data.

orig hardware, no history:

orig hardware, with state history:

new hardware, no history:

One takeaway... the slowness problem seems more intense starting around 55-60G of RAM, even on the 3rd test where the machine has (significantly) more than 64G available, and chain-state-db-size-mb was also changed to be higher (see env doc for details).

cyberluke · 2019-03-26T15:06:03Z

Same problem here and also here: https://eosio.stackexchange.com/questions/4149/node-cant-sync-with-the-mainnet

spoonincode · 2019-04-23T02:28:45Z

mostly fixed by #7047

taokayan added the bug label Mar 12, 2019

spoonincode self-assigned this Mar 18, 2019

spoonincode added the in progress label Mar 18, 2019

spoonincode mentioned this issue Mar 22, 2019

Clean up wasm cache entries based on irreversibility & fix wavm module cleanup #6983

Closed

3 tasks

spoonincode removed the in progress label Apr 23, 2019

spoonincode removed their assignment Apr 23, 2019

spoonincode closed this as completed Apr 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replay slow down around block number 32M running against Mainnet data #6909

Replay slow down around block number 32M running against Mainnet data #6909

nksanthosh commented Mar 9, 2019

nksanthosh commented Mar 9, 2019

wanderingbort commented Mar 9, 2019

dconry commented Mar 9, 2019

matthewdarwin commented Mar 10, 2019

taokayan commented Mar 12, 2019

dconry commented Mar 18, 2019

dconry commented Mar 19, 2019 •

edited

Loading

cyberluke commented Mar 26, 2019

spoonincode commented Apr 23, 2019

Replay slow down around block number 32M running against Mainnet data #6909

Replay slow down around block number 32M running against Mainnet data #6909

Comments

nksanthosh commented Mar 9, 2019

nksanthosh commented Mar 9, 2019

wanderingbort commented Mar 9, 2019

dconry commented Mar 9, 2019

matthewdarwin commented Mar 10, 2019

taokayan commented Mar 12, 2019

dconry commented Mar 18, 2019

dconry commented Mar 19, 2019 • edited Loading

cyberluke commented Mar 26, 2019

spoonincode commented Apr 23, 2019

dconry commented Mar 19, 2019 •

edited

Loading