-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: impl bootstrap cache #2487
Conversation
a6576cd
to
772ef1d
Compare
772ef1d
to
0ce16f5
Compare
0ce16f5
to
3f4b9d6
Compare
3f4b9d6
to
f949029
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
devskim found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
d53f094
to
a073526
Compare
Add persistent bootstrap cache to maintain a list of previously known peers, improving network bootstrapping efficiency and reducing cold-start times. Enhance the bootstrap cache implementation with robust corruption detection and recovery mechanisms. This change ensures system resilience when the cache file becomes corrupted or invalid. Key changes: * Add explicit cache corruption detection and error reporting * Implement cache rebuilding from in-memory peers or endpoints * Use atomic file operations to prevent corruption during writes * Improve error handling with specific error variants * Add comprehensive test suite for corruption scenarios The system now handles corruption by: 1. Detecting invalid/corrupted JSON data during cache reads 2. Attempting recovery using in-memory peers if available 3. Falling back to endpoint discovery if needed 4. Using atomic operations for safe cache updates Testing: * Add tests for various corruption scenarios * Add concurrent access tests * Add file operation tests * Verify endpoint fallback behavior - Add smarter JSON format detection by checking content structure - Improve error handling with specific InvalidResponse variant - Reduce unnecessary warnings by only logging invalid multiaddrs - Simplify parsing logic to handle both JSON and plain text formats - Add better error context for failed parsing attempts All tests passing, including JSON endpoint and plain text format tests. feat(bootstrap_cache): implement circuit breaker with exponential backoff - Add CircuitBreakerConfig with configurable parameters for failures and timeouts - Implement circuit breaker states (closed, open, half-open) with state transitions - Add exponential backoff for failed request retries - Update InitialPeerDiscovery to support custom circuit breaker configuration - Add comprehensive test suite with shorter timeouts for faster testing This change improves system resilience by preventing cascading failures and reducing load on failing endpoints through intelligent retry mechanisms.
…ork isolation * Refactor CacheStore::from_args to handle peer sources more consistently * Ensure test network mode is properly isolated from cache system * Fix default behavior to use URL endpoint when no peers provided * Add proper handling for local and first node modes * Prevent cache operations when in test network mode This change ensures that: - Test network peers are isolated from cache operations - Default behavior (no args) correctly uses URL endpoints - Local and first node modes return empty stores - Explicit peers take precedence over default behavior - Cache operations only occur in non-test network mode The changes make the peer source handling more predictable and maintain proper isolation between different network modes (test, local, default).
* Fix test_safe_peers_env to verify env var peer inclusion - Assert presence of env var peer in total peer set - Remove incorrect assertion of exact peer count * Fix test_network_contacts_fallback isolation - Enable test_network mode to prevent interference from cache/endpoints - Verify exact peer count from mock server * Improve from_args implementation - Add environment variable peer handling before other sources - Use empty cache path in test network mode - Prevent cache file operations in test network mode These changes ensure proper test isolation and correct handling of peers from different sources (env vars, args, cache, endpoints) across different modes (normal, test network, local).
- prep the cache_store to write to disk on periodic interval rather than on every op - use the default config dir that is being used through out the codebase - use simple retries for network GETs rather than using complex backoff
- This also removes the `network-contact` feature flag. - The flag was used to indicate if we should connect to the mainnet or the testnet, which can easily be done with PeersArgs::testnet flag
637cbd1
to
b880cb7
Compare
05e627d
to
9af0540
Compare
9af0540
to
f2dc37f
Compare
a0e00f2
to
2f90166
Compare
the memcheck CI run has a step of
and relies on the It will be the best to refactor it to use |
37a4ae9
to
dfeac3b
Compare
5719508
to
56865c6
Compare
let config = bootstrap_cache.config().clone(); | ||
let mut old_cache = bootstrap_cache.clone(); | ||
|
||
let new = match BootstrapCacheStore::new(config) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't need to have a new one ?
just cotinue with the existing one shall be ok?
anyway, this depends on how the BootstrapCache handles merge
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing one will contain the tracked peers and will only be cleared after calling flush
. That is spawned as a separate task, so we cannot clone it as well. So need a new one here.
ant-node/src/node.rs
Outdated
@@ -160,11 +177,25 @@ impl NodeBuilder { | |||
None | |||
}; | |||
|
|||
if !self.initial_peers.is_empty() && self.bootstrap_cache.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this block (180 - 191) of code can be down in :
let initial_peers = match (self.initial_peers.is_empty(), self.bootstrap_cache) {
(true, Some(_)) => return Err(Error::InitialPeersAndBootstrapCacheSet),
(true, None) => self.initial_peers.clone(),
(false, Some(cache)) => cache.get_sorted_addrs().cloned().collect(),
(false, None) => vec![],
};
Which will be more readable ?
also, shall initial_peers
always be preferred, i.e. it shall be (true, _) => self.initial_peers.clone(),
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh.. we cannot get the initial bootstrap peers from the cache store anylonger as it won't be initialized anymore.
The loading of cache happens at a higher layer and it is fed into initial_peers
. I will modify according to this, thanks for the catch!
self.success_count = self.success_count.saturating_add(other.success_count); | ||
self.failure_count = self.failure_count.saturating_add(other.failure_count); | ||
|
||
// if at max value, reset to 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall be capped
, not reset
?
i.e. no need to take this reset step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't be an issue in the real world, but for some reason if we were to reached the max for success
, then only the failures
will be tracked until it catches up with the success
, making this peer useless as it will not be reliable (success = failure = max)
So just wrap around to 0.
ant-bootstrap
crate which replaces theant-peers-acquisition
crate.bootstrap_cache_<network_id>.json
file in the data dir. It contains a map of PeerId to their addresses. We track the success/failure of each address.sync
the values.network-contacts
feature flag is now removed, and now we use a--testnet
flag to denote if we should connect to the mainnet or to a testnet.