Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: impl bootstrap cache #2487

Merged
merged 21 commits into from
Dec 6, 2024

Conversation

RolandSherwin
Copy link
Member

@RolandSherwin RolandSherwin commented Dec 3, 2024

  • Introduces ant-bootstrap crate which replaces the ant-peers-acquisition crate.
  • Every node now by default writes to a single bootstrap_cache_<network_id>.json file in the data dir. It contains a map of PeerId to their addresses. We track the success/failure of each address.
  • These files are synced on periodic interval. Nodes don't overwrite, but instead sync the values.
  • network-contacts feature flag is now removed, and now we use a --testnet flag to denote if we should connect to the mainnet or to a testnet.

@RolandSherwin RolandSherwin force-pushed the bootstrap_cache_3 branch 2 times, most recently from a6576cd to 772ef1d Compare December 4, 2024 12:17
ant-peers-acquisition/Cargo.toml Outdated Show resolved Hide resolved
Justfile Outdated Show resolved Hide resolved
.gitignore Show resolved Hide resolved
@RolandSherwin RolandSherwin marked this pull request as ready for review December 4, 2024 18:32
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

devskim found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@RolandSherwin RolandSherwin force-pushed the bootstrap_cache_3 branch 5 times, most recently from d53f094 to a073526 Compare December 4, 2024 21:32
dirvine and others added 13 commits December 5, 2024 03:02
Add persistent bootstrap cache to maintain a list of previously known peers,
improving network bootstrapping efficiency and reducing cold-start times.

Enhance the bootstrap cache implementation with robust corruption detection
and recovery mechanisms. This change ensures system resilience when the
cache file becomes corrupted or invalid.

Key changes:
* Add explicit cache corruption detection and error reporting
* Implement cache rebuilding from in-memory peers or endpoints
* Use atomic file operations to prevent corruption during writes
* Improve error handling with specific error variants
* Add comprehensive test suite for corruption scenarios

The system now handles corruption by:
1. Detecting invalid/corrupted JSON data during cache reads
2. Attempting recovery using in-memory peers if available
3. Falling back to endpoint discovery if needed
4. Using atomic operations for safe cache updates

Testing:
* Add tests for various corruption scenarios
* Add concurrent access tests
* Add file operation tests
* Verify endpoint fallback behavior

- Add smarter JSON format detection by checking content structure
- Improve error handling with specific InvalidResponse variant
- Reduce unnecessary warnings by only logging invalid multiaddrs
- Simplify parsing logic to handle both JSON and plain text formats
- Add better error context for failed parsing attempts

All tests passing, including JSON endpoint and plain text format tests.

feat(bootstrap_cache): implement circuit breaker with exponential backoff

- Add CircuitBreakerConfig with configurable parameters for failures and timeouts
- Implement circuit breaker states (closed, open, half-open) with state transitions
- Add exponential backoff for failed request retries
- Update InitialPeerDiscovery to support custom circuit breaker configuration
- Add comprehensive test suite with shorter timeouts for faster testing

This change improves system resilience by preventing cascading failures and
reducing load on failing endpoints through intelligent retry mechanisms.
…ork isolation

* Refactor CacheStore::from_args to handle peer sources more consistently
* Ensure test network mode is properly isolated from cache system
* Fix default behavior to use URL endpoint when no peers provided
* Add proper handling for local and first node modes
* Prevent cache operations when in test network mode

This change ensures that:
- Test network peers are isolated from cache operations
- Default behavior (no args) correctly uses URL endpoints
- Local and first node modes return empty stores
- Explicit peers take precedence over default behavior
- Cache operations only occur in non-test network mode

The changes make the peer source handling more predictable and maintain
proper isolation between different network modes (test, local, default).
* Fix test_safe_peers_env to verify env var peer inclusion
  - Assert presence of env var peer in total peer set
  - Remove incorrect assertion of exact peer count

* Fix test_network_contacts_fallback isolation
  - Enable test_network mode to prevent interference from cache/endpoints
  - Verify exact peer count from mock server

* Improve from_args implementation
  - Add environment variable peer handling before other sources
  - Use empty cache path in test network mode
  - Prevent cache file operations in test network mode

These changes ensure proper test isolation and correct handling of peers
from different sources (env vars, args, cache, endpoints) across different
modes (normal, test network, local).
- prep the cache_store to write to disk on periodic interval rather than
  on every op
- use the default config dir that is being used through out the codebase
- use simple retries for network GETs rather than using complex backoff
- This also removes the `network-contact` feature flag.
- The flag was used to indicate if we should connect to the mainnet or
  the testnet, which can easily be done with PeersArgs::testnet flag
@RolandSherwin RolandSherwin force-pushed the bootstrap_cache_3 branch 2 times, most recently from 637cbd1 to b880cb7 Compare December 4, 2024 21:49
ant-networking/src/driver.rs Outdated Show resolved Hide resolved
@RolandSherwin RolandSherwin force-pushed the bootstrap_cache_3 branch 2 times, most recently from a0e00f2 to 2f90166 Compare December 5, 2024 10:18
@maqi
Copy link
Member

maqi commented Dec 5, 2024

the memcheck CI run has a step of Start a node instance to be restarted , which currently using :

./target/release/antnode --root-dir $RESTART_TEST_NODE_DATA_PATH --log-output-dest $RESTART_TEST_NODE_DATA_PATH --local --rewards-address "0x03B770D9cD32077cC0bF330c13C114a87643B124" &

and relies on the ANT_PEERS env to be set.

It will be the best to refactor it to use bootstrap cache shared_file to startup with, so that the work can be verified by CI directly.

ant-bootstrap/src/contacts.rs Dismissed Show dismissed Hide dismissed
ant-bootstrap/src/contacts.rs Dismissed Show dismissed Hide dismissed
ant-bootstrap/src/contacts.rs Dismissed Show dismissed Hide dismissed
ant-bootstrap/src/contacts.rs Dismissed Show dismissed Hide dismissed
ant-bootstrap/src/contacts.rs Dismissed Show dismissed Hide dismissed
ant-bootstrap/tests/address_format_tests.rs Dismissed Show dismissed Hide dismissed
ant-bootstrap/tests/address_format_tests.rs Dismissed Show dismissed Hide dismissed
ant-bootstrap/tests/cache_tests.rs Dismissed Show dismissed Hide dismissed
ant-bootstrap/tests/cli_integration_tests.rs Dismissed Show dismissed Hide dismissed
ant-bootstrap/tests/cli_integration_tests.rs Dismissed Show dismissed Hide dismissed
ant-networking/src/driver.rs Outdated Show resolved Hide resolved
let config = bootstrap_cache.config().clone();
let mut old_cache = bootstrap_cache.clone();

let new = match BootstrapCacheStore::new(config) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need to have a new one ?
just cotinue with the existing one shall be ok?

anyway, this depends on how the BootstrapCache handles merge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing one will contain the tracked peers and will only be cleared after calling flush. That is spawned as a separate task, so we cannot clone it as well. So need a new one here.

ant-networking/src/event/swarm.rs Outdated Show resolved Hide resolved
@@ -160,11 +177,25 @@ impl NodeBuilder {
None
};

if !self.initial_peers.is_empty() && self.bootstrap_cache.is_some() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block (180 - 191) of code can be down in :

let initial_peers = match (self.initial_peers.is_empty(), self.bootstrap_cache) {
    (true, Some(_)) => return Err(Error::InitialPeersAndBootstrapCacheSet),
    (true, None) => self.initial_peers.clone(),
    (false, Some(cache)) => cache.get_sorted_addrs().cloned().collect(),
    (false, None) => vec![],
};

Which will be more readable ?

also, shall initial_peers always be preferred, i.e. it shall be (true, _) => self.initial_peers.clone(), ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh.. we cannot get the initial bootstrap peers from the cache store anylonger as it won't be initialized anymore.
The loading of cache happens at a higher layer and it is fed into initial_peers. I will modify according to this, thanks for the catch!

self.success_count = self.success_count.saturating_add(other.success_count);
self.failure_count = self.failure_count.saturating_add(other.failure_count);

// if at max value, reset to 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall be capped, not reset ?
i.e. no need to take this reset step

Copy link
Member Author

@RolandSherwin RolandSherwin Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be an issue in the real world, but for some reason if we were to reached the max for success, then only the failures will be tracked until it catches up with the success, making this peer useless as it will not be reliable (success = failure = max)

So just wrap around to 0.

@RolandSherwin RolandSherwin added this pull request to the merge queue Dec 6, 2024
Merged via the queue into maidsafe:main with commit 9c2433b Dec 6, 2024
27 of 28 checks passed
@RolandSherwin RolandSherwin deleted the bootstrap_cache_3 branch December 6, 2024 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants