Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Avoid long state queries when serving GetNodeData requests #11444

Merged
merged 12 commits into from
Feb 4, 2020

Conversation

dvdplm
Copy link
Collaborator

@dvdplm dvdplm commented Feb 4, 2020

As networking and db access happen on the same thread, queries for state data can take a long while. To mitigate this, this PR makes the read lock more granular and sets a timeout of 2 seconds to acquire the lock. It also aborts GetNodeData requests early if a single query takes more than 50ms or the whole batch has taken more than 5 seconds to complete.

GetNodeData requests are mainly coming from FastSync nodes (only geth clients?) and when the requested hash is not present in the DB they can take a very long time (up to 220 seconds) and cause stalls and peering issues. It is unclear exactly why we are slow in these cases and this PR does not address the root cause; in my tests it does seem to improve responsiveness significantly though. It's worth noting that the protocol specs mandate nodes to implement GetNodeData but implementors are free to return incomplete responses.

It is unclear to me why geth nodes are asking for inexistent data but it could simply be reorgs?

* master:
  update kvdb-rocksdb to 0.4 (#11442)
  Rough architecutre diagram. (#11396)
  ethjson: impl Copy for hash type wrapper (#11423)
  Remove dead bootnodes, add new geth bootnodes (#11441)
  goerli: replace foundation bootnode (#11433)
Finish GetDataNode requests early if queries take too long
* master:
  chore: remove unused dependencies (#11432)
@dvdplm dvdplm self-assigned this Feb 4, 2020
@dvdplm dvdplm added the A3-inprogress ⏳ Pull request is in progress. No review needed at this stage. label Feb 4, 2020
@dvdplm dvdplm added A0-pleasereview 🤓 Pull request needs code review. and removed A3-inprogress ⏳ Pull request is in progress. No review needed at this stage. labels Feb 4, 2020
@dvdplm dvdplm marked this pull request as ready for review February 4, 2020 12:58
@dvdplm dvdplm added the B0-patch-stable 🕷 Pull request should also be back-ported to the stable branch. label Feb 4, 2020
Copy link
Collaborator

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced we need this, also extracted the actual fix to #11447 for easier backporting

ethcore/sync/Cargo.toml Outdated Show resolved Hide resolved
ethcore/sync/src/chain/supplier.rs Outdated Show resolved Hide resolved
@dvdplm dvdplm mentioned this pull request Feb 4, 2020
Co-Authored-By: Andronik Ordian <write@reusable.software>
@dvdplm
Copy link
Collaborator Author

dvdplm commented Feb 4, 2020

I'm not convinced we need this

Can you elaborate on your doubts? The real fix here is your kvdb-rocksdb fix of course, but I think the locking changes here are worth keeping and I think we can and should do more than checking the payload size.

@ordian
Copy link
Collaborator

ordian commented Feb 4, 2020

I'm not convinced we need this

Can you elaborate on your doubts? The real fix here is your kvdb-rocksdb fix of course, but I think the locking changes here are worth keeping and I think we can and should do more than checking the payload size.

I mean if the request takes a long time, it's definitely a bug we shouldn't silence it, but having a warning instead of taking a freezing a client is definitely an improvement, so I changed my mind :)

@dvdplm
Copy link
Collaborator Author

dvdplm commented Feb 4, 2020

After ~4h of running this (including @ordian's kvdb-rocksdb fix) with ~150 peers, the slowest query I've seen is 105ms.

@niklasad1
Copy link
Collaborator

niklasad1 commented Feb 4, 2020

After ~4h of running this (including @ordian's kvdb-rocksdb fix) with ~150 peers, the slowest query I've seen is 105ms.

Before (2.7.1), you saw queries in the magnitude of seconds in the worst case, right?

…thereum into dp/debug/sync-stalls

* 'dp/debug/sync-stalls' of github.com:paritytech/parity-ethereum:
  Update ethcore/sync/src/chain/supplier.rs
@dvdplm
Copy link
Collaborator Author

dvdplm commented Feb 4, 2020

Before (2.7.1), you saw queries in the magnitude of seconds in the worst case, right?

Oh no, much much worse. The worst I saw was, I think, 221 seconds, but I suspect there were even worse lockups than that before I added logging.

@niklasad1 niklasad1 added A8-looksgood 🦄 Pull request is reviewed well. and removed A0-pleasereview 🤓 Pull request needs code review. labels Feb 4, 2020
@ordian ordian merged commit cf6fa63 into master Feb 4, 2020
@ordian ordian deleted the dp/debug/sync-stalls branch February 4, 2020 19:59
dvdplm added a commit that referenced this pull request Feb 4, 2020
…pstream

* master:
  Avoid long state queries when serving GetNodeData requests (#11444)
  Cargo.lock: cargo update -p kvdb-rocksdb (#11447)
  rlp_derive: cleanup (#11446)
  chore: remove unused dependencies (#11432)
  update kvdb-rocksdb to 0.4 (#11442)
  Rough architecutre diagram. (#11396)
  ethjson: impl Copy for hash type wrapper (#11423)
  Remove dead bootnodes, add new geth bootnodes (#11441)
  goerli: replace foundation bootnode (#11433)
  Update publish-docker.sh (#11428)
  Revert "[Trace] Distinguish between `create` and `create2` (#11311)" (#11427)
s3krit pushed a commit that referenced this pull request Feb 5, 2020
* Remove dead bootnodes, add new geth bootnodes

* More granular locking when fetching state
Finish GetDataNode requests early if queries take too long

* typo

* Use latest kvdb-rocksdb

* Cleanup

* Update ethcore/sync/src/chain/supplier.rs

Co-Authored-By: Andronik Ordian <write@reusable.software>

* Address review grumbles

* Fix compilation

* Address review grumbles

Co-authored-by: Andronik Ordian <write@reusable.software>
@s3krit s3krit mentioned this pull request Feb 5, 2020
ordian added a commit that referenced this pull request Feb 5, 2020
* master:
  gcc to clang (#11453)
  Avoid copies (#11451)
  Cargo.lock: cargo lock translate (#11448)
  Avoid long state queries when serving GetNodeData requests (#11444)
  Cargo.lock: cargo update -p kvdb-rocksdb (#11447)
s3krit added a commit that referenced this pull request Feb 5, 2020
* update classic testnet bootnodes (#11398)

* update classic testnet bootnodes

* Update kotti.json

* Update kotti.json

* Update kotti.json

* Update mordor.json

* verification: fix race same block + misc (#11400)

* ethcore: fix race in verification

* verification: fix some nits

* verification: refactor err type `Kind::create`

* fix: tests

* address grumbles

* address grumbles: don't panic

* fix: export hardcoded sync format (#11416)

* fix: export hardcoded sync format

* address grumbles

* make tests compile with rustc_hex 2.0

* fix grumbles: impl LowerHex for encoded Header

* goerli: replace foundation bootnode (#11433)

* Remove dead bootnodes, add new geth bootnodes (#11441)

* update kvdb-rocksdb to 0.4 (#11442)

* Avoid long state queries when serving GetNodeData requests (#11444)

* Remove dead bootnodes, add new geth bootnodes

* More granular locking when fetching state
Finish GetDataNode requests early if queries take too long

* typo

* Use latest kvdb-rocksdb

* Cleanup

* Update ethcore/sync/src/chain/supplier.rs

Co-Authored-By: Andronik Ordian <write@reusable.software>

* Address review grumbles

* Fix compilation

* Address review grumbles

Co-authored-by: Andronik Ordian <write@reusable.software>

* rlp_derive: cleanup (#11446)

* rlp_derive: update syn & co

* rlp_derive: remove dummy_const

* rlp_derive: remove unused attirubutes

* rlp-derive: change authors

* Cargo.lock: cargo update -p kvdb-rocksdb (#11447)

* Cargo.lock: new lockfile format

Manual backport of #11448

* gcc to clang (#11453)

* gcc to clang

test
```
export CC="sccache "$CC
export CXX="sccache "$CXX
```
darwin build
```
CC=clang
CXX=clang
```

* darwin - > default clang

* Bump version and CHANGELOG.md

* chore: remove unused dependencies (#11432)

* fix: compiler warnings

* chore: remove unused dependencies

* Update CHANGELOG.md

* update Cargo.lock

* update CHANGELOG.md

* backwards compatible call_type creation_method (#11450)

* rlp_derive: update syn & co

* rlp_derive: remove dummy_const

* rlp_derive: remove unused attirubutes

* rlp-derive: change authors

* rlp_derive: add rlp(default) attribute

* Revert "Revert "[Trace] Distinguish between `create` and `create2` (#11311)" (#11427)"

This reverts commit 5d4993b.

* trace: backwards compatible call_type and creation_method

* trace: add rlp backward compatibility tests

* cleanup

* i know, i hate backwards compatibility too

* address review grumbles

* update CHANGELOG.md

* just to make sure we don't screw up this again (#11455)

* Update CHANGELOG.md

Co-authored-by: Talha Cross <47772477+soc1c@users.noreply.github.com>
Co-authored-by: Niklas Adolfsson <niklasadolfsson1@gmail.com>
Co-authored-by: Martin Holst Swende <martin@swende.se>
Co-authored-by: David <dvdplm@gmail.com>
Co-authored-by: Andronik Ordian <write@reusable.software>
Co-authored-by: Denis S. Soldatov aka General-Beck <general.beck@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A8-looksgood 🦄 Pull request is reviewed well. B0-patch-stable 🕷 Pull request should also be back-ported to the stable branch.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants