feat: impl bootstrap cache #2487

RolandSherwin · 2024-12-03T23:47:18Z

Introduces ant-bootstrap crate which replaces the ant-peers-acquisition crate.
Every node now by default writes to a single bootstrap_cache_<network_id>.json file in the data dir. It contains a map of PeerId to their addresses. We track the success/failure of each address.
These files are synced on periodic interval. Nodes don't overwrite, but instead sync the values.
network-contacts feature flag is now removed, and now we use a --testnet flag to denote if we should connect to the mainnet or to a testnet.

ant-peers-acquisition/Cargo.toml

Justfile

.gitignore

ant-peers-acquisition/Cargo.toml

github-advanced-security

devskim found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Add persistent bootstrap cache to maintain a list of previously known peers, improving network bootstrapping efficiency and reducing cold-start times. Enhance the bootstrap cache implementation with robust corruption detection and recovery mechanisms. This change ensures system resilience when the cache file becomes corrupted or invalid. Key changes: * Add explicit cache corruption detection and error reporting * Implement cache rebuilding from in-memory peers or endpoints * Use atomic file operations to prevent corruption during writes * Improve error handling with specific error variants * Add comprehensive test suite for corruption scenarios The system now handles corruption by: 1. Detecting invalid/corrupted JSON data during cache reads 2. Attempting recovery using in-memory peers if available 3. Falling back to endpoint discovery if needed 4. Using atomic operations for safe cache updates Testing: * Add tests for various corruption scenarios * Add concurrent access tests * Add file operation tests * Verify endpoint fallback behavior - Add smarter JSON format detection by checking content structure - Improve error handling with specific InvalidResponse variant - Reduce unnecessary warnings by only logging invalid multiaddrs - Simplify parsing logic to handle both JSON and plain text formats - Add better error context for failed parsing attempts All tests passing, including JSON endpoint and plain text format tests. feat(bootstrap_cache): implement circuit breaker with exponential backoff - Add CircuitBreakerConfig with configurable parameters for failures and timeouts - Implement circuit breaker states (closed, open, half-open) with state transitions - Add exponential backoff for failed request retries - Update InitialPeerDiscovery to support custom circuit breaker configuration - Add comprehensive test suite with shorter timeouts for faster testing This change improves system resilience by preventing cascading failures and reducing load on failing endpoints through intelligent retry mechanisms.

…ork isolation * Refactor CacheStore::from_args to handle peer sources more consistently * Ensure test network mode is properly isolated from cache system * Fix default behavior to use URL endpoint when no peers provided * Add proper handling for local and first node modes * Prevent cache operations when in test network mode This change ensures that: - Test network peers are isolated from cache operations - Default behavior (no args) correctly uses URL endpoints - Local and first node modes return empty stores - Explicit peers take precedence over default behavior - Cache operations only occur in non-test network mode The changes make the peer source handling more predictable and maintain proper isolation between different network modes (test, local, default).

* Fix test_safe_peers_env to verify env var peer inclusion - Assert presence of env var peer in total peer set - Remove incorrect assertion of exact peer count * Fix test_network_contacts_fallback isolation - Enable test_network mode to prevent interference from cache/endpoints - Verify exact peer count from mock server * Improve from_args implementation - Add environment variable peer handling before other sources - Use empty cache path in test network mode - Prevent cache file operations in test network mode These changes ensure proper test isolation and correct handling of peers from different sources (env vars, args, cache, endpoints) across different modes (normal, test network, local).

- prep the cache_store to write to disk on periodic interval rather than on every op - use the default config dir that is being used through out the codebase - use simple retries for network GETs rather than using complex backoff

…urpose

- This also removes the `network-contact` feature flag. - The flag was used to indicate if we should connect to the mainnet or the testnet, which can easily be done with PeersArgs::testnet flag

ant-networking/src/driver.rs

maqi · 2024-12-05T16:21:27Z

the memcheck CI run has a step of Start a node instance to be restarted , which currently using :

./target/release/antnode --root-dir $RESTART_TEST_NODE_DATA_PATH --log-output-dest $RESTART_TEST_NODE_DATA_PATH --local --rewards-address "0x03B770D9cD32077cC0bF330c13C114a87643B124" &

and relies on the ANT_PEERS env to be set.

It will be the best to refactor it to use bootstrap cache shared_file to startup with, so that the work can be verified by CI directly.

ant-bootstrap/src/contacts.rs

ant-bootstrap/tests/address_format_tests.rs

ant-bootstrap/tests/cache_tests.rs

ant-bootstrap/tests/cli_integration_tests.rs

ant-networking/src/driver.rs

maqi · 2024-12-06T11:38:11Z

ant-networking/src/driver.rs

+                    let config = bootstrap_cache.config().clone();
+                    let mut old_cache = bootstrap_cache.clone();
+
+                    let new = match BootstrapCacheStore::new(config) {


doesn't need to have a new one ?
just cotinue with the existing one shall be ok?

anyway, this depends on how the BootstrapCache handles merge.

The existing one will contain the tracked peers and will only be cleared after calling flush. That is spawned as a separate task, so we cannot clone it as well. So need a new one here.

ant-networking/src/event/swarm.rs

maqi · 2024-12-06T11:56:41Z

ant-node/src/node.rs

@@ -160,11 +177,25 @@ impl NodeBuilder {
            None
        };

+        if !self.initial_peers.is_empty() && self.bootstrap_cache.is_some() {


this block (180 - 191) of code can be down in :

let initial_peers = match (self.initial_peers.is_empty(), self.bootstrap_cache) { (true, Some(_)) => return Err(Error::InitialPeersAndBootstrapCacheSet), (true, None) => self.initial_peers.clone(), (false, Some(cache)) => cache.get_sorted_addrs().cloned().collect(), (false, None) => vec![], };

Which will be more readable ?

also, shall initial_peers always be preferred, i.e. it shall be (true, _) => self.initial_peers.clone(), ?

Ahh.. we cannot get the initial bootstrap peers from the cache store anylonger as it won't be initialized anymore.
The loading of cache happens at a higher layer and it is fed into initial_peers. I will modify according to this, thanks for the catch!

maqi · 2024-12-06T12:39:49Z

ant-bootstrap/src/lib.rs

+        self.success_count = self.success_count.saturating_add(other.success_count);
+        self.failure_count = self.failure_count.saturating_add(other.failure_count);
+
+        // if at max value, reset to 0


shall be capped, not reset ?
i.e. no need to take this reset step

This won't be an issue in the real world, but for some reason if we were to reached the max for success, then only the failures will be tracked until it catches up with the success, making this peer useless as it will not be reliable (success = failure = max)

So just wrap around to 0.

RolandSherwin force-pushed the bootstrap_cache_3 branch 2 times, most recently from a6576cd to 772ef1d Compare December 4, 2024 12:17

b-zee reviewed Dec 4, 2024

View reviewed changes

ant-peers-acquisition/Cargo.toml Outdated Show resolved Hide resolved

Justfile Outdated Show resolved Hide resolved

b-zee reviewed Dec 4, 2024

View reviewed changes

.gitignore Show resolved Hide resolved

RolandSherwin force-pushed the bootstrap_cache_3 branch from 772ef1d to 0ce16f5 Compare December 4, 2024 14:06

jacderida reviewed Dec 4, 2024

View reviewed changes

ant-peers-acquisition/Cargo.toml Outdated Show resolved Hide resolved

RolandSherwin force-pushed the bootstrap_cache_3 branch from 0ce16f5 to 3f4b9d6 Compare December 4, 2024 18:32

RolandSherwin marked this pull request as ready for review December 4, 2024 18:32

RolandSherwin force-pushed the bootstrap_cache_3 branch from 3f4b9d6 to f949029 Compare December 4, 2024 18:32

github-advanced-security bot found potential problems Dec 4, 2024

View reviewed changes

RolandSherwin force-pushed the bootstrap_cache_3 branch 5 times, most recently from d53f094 to a073526 Compare December 4, 2024 21:32

dirvine and others added 13 commits December 5, 2024 03:02

chore: update readme

af2c35f

fix(bootstrap): remove rwlock from the store

45c26ff

feat(bootstrap): wrap the counts when reaching the max bounds

f3f7220

fix(bootstrap): couple more tiny fixes

f5af65e

feat(bootstrap): store multiple multiaddr per peer

62fe748

feat(bootstrap): isolate code into their own modules based on their p…

1ce7f63

…urpose

feat(bootstrap): impl bootstrap cache into the codebase

460bc67

feat: remove ant-peers-acquisition and use ant-bootstrap instead

65e2170

- This also removes the `network-contact` feature flag. - The flag was used to indicate if we should connect to the mainnet or the testnet, which can easily be done with PeersArgs::testnet flag

fix(bootstrap): use env tempdir for atomic write

f8bb46f

RolandSherwin force-pushed the bootstrap_cache_3 branch 2 times, most recently from 637cbd1 to b880cb7 Compare December 4, 2024 21:49

RolandSherwin force-pushed the bootstrap_cache_3 branch 4 times, most recently from 05e627d to 9af0540 Compare December 4, 2024 22:01

RolandSherwin mentioned this pull request Dec 4, 2024

feat(networking): add bootstrap cache for peer discovery #2459

Closed

fix(bootstrap): make it wasm compatible

d8f3ac7

RolandSherwin force-pushed the bootstrap_cache_3 branch from 9af0540 to f2dc37f Compare December 5, 2024 10:03

maqi requested changes Dec 5, 2024

View reviewed changes

ant-networking/src/driver.rs Outdated Show resolved Hide resolved

RolandSherwin force-pushed the bootstrap_cache_3 branch 2 times, most recently from a0e00f2 to 2f90166 Compare December 5, 2024 10:18

RolandSherwin added 2 commits December 5, 2024 15:54

fix(bootstrap): use atomic write crate and remove locks

cd44b1b

feat(ci): enable bootstrap tests

d89d6d2

RolandSherwin added 2 commits December 5, 2024 23:11

feat(bootstrap): rework the api to not hold persistant state

c70ee45

chore(bootstrap): remove components related to serving the json

dfeac3b

RolandSherwin force-pushed the bootstrap_cache_3 branch from 37a4ae9 to dfeac3b Compare December 5, 2024 17:58

github-advanced-security bot found potential problems Dec 5, 2024

View reviewed changes

RolandSherwin added the DoNotMerge label Dec 6, 2024

feat(bootstrap): write bootstrap cache from the clients

56865c6

RolandSherwin force-pushed the bootstrap_cache_3 branch from 5719508 to 56865c6 Compare December 6, 2024 10:40

chore(antctl): use PeersArg::local instead of a separate arg

469e496

maqi reviewed Dec 6, 2024

View reviewed changes

maqi approved these changes Dec 6, 2024

View reviewed changes

chore: update based on comments

7bbccf5

RolandSherwin removed the DoNotMerge label Dec 6, 2024

RolandSherwin added this pull request to the merge queue Dec 6, 2024

Merged via the queue into maidsafe:main with commit 9c2433b Dec 6, 2024
27 of 28 checks passed

RolandSherwin deleted the bootstrap_cache_3 branch December 6, 2024 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: impl bootstrap cache #2487

feat: impl bootstrap cache #2487

RolandSherwin commented Dec 3, 2024 •

edited

Loading

github-advanced-security bot left a comment

maqi commented Dec 5, 2024 •

edited

Loading

maqi Dec 6, 2024

RolandSherwin Dec 6, 2024

maqi Dec 6, 2024

RolandSherwin Dec 6, 2024

maqi Dec 6, 2024

RolandSherwin Dec 6, 2024 •

edited

Loading

feat: impl bootstrap cache #2487

feat: impl bootstrap cache #2487

Conversation

RolandSherwin commented Dec 3, 2024 • edited Loading

github-advanced-security bot left a comment

Choose a reason for hiding this comment

maqi commented Dec 5, 2024 • edited Loading

maqi Dec 6, 2024

Choose a reason for hiding this comment

RolandSherwin Dec 6, 2024

Choose a reason for hiding this comment

maqi Dec 6, 2024

Choose a reason for hiding this comment

RolandSherwin Dec 6, 2024

Choose a reason for hiding this comment

maqi Dec 6, 2024

Choose a reason for hiding this comment

RolandSherwin Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

RolandSherwin commented Dec 3, 2024 •

edited

Loading

maqi commented Dec 5, 2024 •

edited

Loading

RolandSherwin Dec 6, 2024 •

edited

Loading