FN/BN CPU BBQ #3630
Labels
bug
Something isn't working
priority:high
High priority issue to be prioritized for the current/following sprint
v0.17.0
Intended for v0.17.0 release
Overview
Recent spikes in network activity increased resource consumption for FN/BN node operators to inadequate levels.
Problem Description
The flamegraph shown below is from the profile captured at the exact moment where such a spike happened.
The profile suggest that the most of the time spent happens during
GetSize
operation on theBlockstore
. FN/BN usesGetSize
to determine whether it has data or not. This check is triggered by remote Bitswap requests from LNs. Before requesting data itself, LNs check which peers have the data by broadcasting these requests to all connected peers(in worst case). Once they know who has the data, they request a particular peer for the data.As it turns out, checking existence of data is an expensive operation which is as expensive as getting the data itself. This comes from the inverted_index the main bottleneck of the celestia-node(at the moment of writing). This index is maintained within KVStore that is enormous and simple lookup operations are the source of the CPU overhead we seeing.
Overall, we have LNs broadcasting data checks for every sample operations, which, in fact, consists of multiple requests where each of those multiple requests get duplicated to every connected peer in the worst case, creating massive load on the network.
Potential Solutions
DataRoot/Hash
routing FN to respective EDS CAR file.Decision
After internal discussion we decided to go with "Dummy Solution". It solves the problem in a simple and fast way and tradeoffs it brings are negligible.
We may still do additionally the solution that removes inverted index entirely to win more time if Shwap gets delayed or if there is more pressure on achieving bigger blocks that inverted index is blocking as well
The text was updated successfully, but these errors were encountered: