Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ready time and priority to mempool messages for better metric calculation #14037

Merged
merged 25 commits into from
Jul 24, 2024

Conversation

vusirikala
Copy link
Contributor

Description

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Other (specify)

How Has This Been Tested?

Key Areas to Review

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link

trunk-io bot commented Jul 18, 2024

⏱️ 33h 37m total CI duration on this PR
Job Cumulative Duration Recent Runs
execution-performance / single-node-performance 7h 44m 🟩🟩🟩🟩🟩 (+17 more)
test-fuzzers 6h 7m 🟩🟩🟩🟥🟩 (+3 more)
forge-e2e-test / forge 5h 14m 🟥🟥🟥🟩🟩 (+13 more)
forge-compat-test / forge 4h 14m 🟩🟩🟩🟩🟩 (+12 more)
execution-performance / test-target-determinator 1h 40m 🟩🟩🟩🟩🟩 (+17 more)
test-target-determinator 1h 31m 🟩🟩🟩🟩🟩 (+17 more)
check 1h 8m 🟩🟩🟩🟩🟩 (+16 more)
general-lints 40m 🟩🟩🟩🟩🟩 (+18 more)
rust-cargo-deny 40m 🟩🟩🟩🟩🟩 (+18 more)
check-dynamic-deps 31m 🟩🟩🟩🟩🟩 (+20 more)
indexer-grpc-e2e-tests / test-indexer-grpc-docker-compose 27m 🟩🟩🟩🟩🟩 (+12 more)
forge-framework-upgrade-test / forge 18m 🟩
rust-doc-tests 15m 🟩
rust-doc-tests 14m 🟩
semgrep/ci 10m 🟩🟩🟩🟩🟩 (+20 more)
rust-doc-tests 6m 🟩
rust-move-tests 6m 🟩
rust-move-tests 6m 🟩
rust-doc-tests 6m 🟩
rust-doc-tests 6m 🟩
rust-move-tests 6m 🟩
permission-check 6m 🟩🟩🟩🟩🟩 (+21 more)
rust-doc-tests 6m 🟩
rust-doc-tests 6m 🟩
rust-doc-tests 6m 🟩
rust-doc-tests 6m 🟩
rust-move-tests 6m 🟩
rust-doc-tests 6m 🟩
file_change_determinator 5m 🟩🟩🟩🟩🟩 (+20 more)
rust-doc-tests 5m
rust-doc-tests 5m 🟩
file_change_determinator 5m 🟩🟩🟩🟩🟩 (+19 more)
file_change_determinator 4m 🟩🟩🟩🟩🟩 (+17 more)
rust-move-tests 3m 🟩
rust-doc-tests 3m 🟥
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m
rust-move-tests 3m 🟩
rust-doc-tests 2m
rust-move-tests 2m 🟩
rust-move-tests 2m 🟩
rust-move-tests 2m 🟩
permission-check 2m 🟩🟩🟩🟩🟩 (+20 more)
adhoc-forge-test / forge 1m 🟥
permission-check 1m 🟩🟩🟩🟩🟩 (+21 more)
permission-check 1m 🟩🟩🟩🟩🟩 (+20 more)
permission-check 1m 🟩🟩🟩🟩🟩 (+17 more)
determine-docker-build-metadata 49s 🟩🟩🟩🟩🟩 (+17 more)
rust-move-tests 14s
rust-doc-tests 8s
Backport PR 6s 🟥🟥
permission-check 4s 🟩🟩
determine-forge-run-metadata 2s 🟩
rust-move-tests 1s

🚨 3 jobs on the last run were significantly faster/slower than expected

Job Duration vs 7d avg Delta
forge-framework-upgrade-test / forge 18m 12m +51%
execution-performance / single-node-performance 22m 18m +25%
test-fuzzers 46m 38m +21%

settingsfeedbackdocs ⋅ learn more about trunk.io

@vusirikala vusirikala requested review from sitalkedia and removed request for gregnazario and JoshLind July 18, 2024 01:29
@vusirikala vusirikala requested a review from igor-aptos July 18, 2024 01:36
@vusirikala vusirikala added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Jul 18, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@vusirikala vusirikala marked this pull request as draft July 18, 2024 02:32

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.


/// Uses the MempoolSyncMessageV2 instead of MempoolSyncMessage when sending mempool transactions
/// to upstream nodes.
pub use_mempool_sync_message_v2: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we call the new message type BroadcastTransactionsRequestV2 -> BroadcastTransactionsRequestWithReadyTime or something similar that has significance (since we're not going to deprecate "v1"). Maybe this can be include_ready_time_in_broadcast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -435,7 +435,7 @@ impl Default for AptosDataClientConfig {
data_multi_fetch_config: AptosDataMultiFetchConfig::default(),
ignore_low_score_peers: true,
latency_filtering_config: AptosLatencyFilteringConfig::default(),
latency_monitor_loop_interval_ms: 100,
latency_monitor_loop_interval_ms: 10,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change? Is this required in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually meant to improve the accuracy of state sync metric collection. Not relevant to this particular PR. So, removing it.

let status = self.transactions.insert(txn_info);
let now = now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use aptos_infallible::duration_since_epoch().as_millis() to be consistent with the rest of the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.as_millis() as u64;

// TODO: Remove this before landing
info!("txn added to mempool: {} {} status {}, priority {:?}, client_submitted {}, now: {:?}, inserted_at_sender {:?}, time_since: {:?}", txn.sender(), txn.sequence_number(), status, priority.clone(), client_submitted, now, ready_time_at_sender, Duration::from_millis(now.saturating_sub(ready_time_at_sender.unwrap_or(0))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure to remove this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

transactions: Vec<SignedTransaction>,
// For each transaction, we include the ready time in millis since epoch
transactions: Vec<(SignedTransaction, u64)>,
use_mempool_sync_message_v2: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass this? We should be able to read from self.mempool_config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -592,13 +592,14 @@ impl TransactionStore {
if batch_total_bytes.saturating_add(transaction_bytes) > self.max_batch_bytes {
break; // The batch is full
} else {
batch.push((txn.txn.clone(), txn.insertion_info.insertion_time.duration_since(UNIX_EPOCH).expect("Failed to determine absolute unix time based on given duration")
batch.push((txn.txn.clone(), txn.insertion_info.ready_time.duration_since(UNIX_EPOCH).expect("Failed to determine absolute unix time based on given duration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use aptos_infallable::duration_since_epoch_at(txn.insertion_info.ready_time)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -640,7 +641,7 @@ impl TransactionStore {
self.transactions
.get(account)
.and_then(|txns| txns.get(sequence_number))
.map(|txn| (txn.txn.clone(), txn.insertion_info.insertion_time.duration_since(UNIX_EPOCH).expect("Failed to determine absolute unix time based on given duration")
.map(|txn| (txn.txn.clone(), txn.insertion_info.ready_time.duration_since(UNIX_EPOCH).expect("Failed to determine absolute unix time based on given duration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@@ -60,9 +60,9 @@ pub struct MempoolConfig {
pub broadcast_buckets: Vec<u64>,
pub eager_expire_threshold_ms: Option<u64>,
pub eager_expire_time_ms: u64,
/// Uses the MempoolSyncMessageV2 instead of MempoolSyncMessage when sending mempool transactions
/// Uses the BroadcastTransactionsRequestWithReadyTime instead of MempoolSyncMessage when sending mempool transactions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove MempoolSyncMessage -> BroadcastTransactionsRequestWithReadyTime instead of BroadcastTransactionsRequest

@@ -603,11 +603,12 @@ fn k8s_test_suite() -> ForgeConfig {
fn get_land_blocking_test(
test_name: &str,
duration: Duration,
test_cmd: &TestCommand,
_test_cmd: &TestCommand,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to revert the forge changes before landing

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

@sitalkedia sitalkedia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vusirikala vusirikala enabled auto-merge (squash) July 24, 2024 18:52

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 71c5af9da3f392b4613bfb72bc954c6523bcc06d

two traffics test: inner traffic : committed: 9448.383201876699 txn/s, submitted: 9542.799152459089 txn/s, failed submission: 25.2484226744738 txn/s, expired: 94.41595058238906 txn/s, latency: 16498.459245852657 ms, (p50: 13900 ms, p90: 29000 ms, p99: 53100 ms), latency samples: 3592481
two traffics test : committed: 100.01331874001175 txn/s, latency: 1988.412 ms, (p50: 2000 ms, p90: 2300 ms, p99: 2900 ms), latency samples: 2000
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.257, avg: 0.214", "QsPosToProposal: max: 1.402, avg: 1.182", "ConsensusProposalToOrdered: max: 0.313, avg: 0.291", "ConsensusOrderedToCommit: max: 0.397, avg: 0.382", "ConsensusProposalToCommit: max: 0.689, avg: 0.673"]
Max round gap was 1 [limit 4] at version 1836820. Max no progress secs was 5.827482 [limit 15] at version 1836820.
Test Ok

Copy link
Contributor

✅ Forge suite compat success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 71c5af9da3f392b4613bfb72bc954c6523bcc06d

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 71c5af9da3f392b4613bfb72bc954c6523bcc06d (PR)
1. Check liveness of validators at old version: 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5
compatibility::simple-validator-upgrade::liveness-check : committed: 6539.3324937332445 txn/s, latency: 5006.755946907904 ms, (p50: 4200 ms, p90: 6800 ms, p99: 25600 ms), latency samples: 269720
2. Upgrading first Validator to new version: 71c5af9da3f392b4613bfb72bc954c6523bcc06d
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 6425.939287638595 txn/s, latency: 4153.918636643099 ms, (p50: 4700 ms, p90: 5400 ms, p99: 5600 ms), latency samples: 130120
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6817.512016122267 txn/s, latency: 4625.902609172146 ms, (p50: 4800 ms, p90: 6300 ms, p99: 6900 ms), latency samples: 237240
3. Upgrading rest of first batch to new version: 71c5af9da3f392b4613bfb72bc954c6523bcc06d
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6969.802306974917 txn/s, latency: 3963.4912078694524 ms, (p50: 4500 ms, p90: 4900 ms, p99: 5000 ms), latency samples: 131140
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6454.549655518086 txn/s, latency: 4759.502163980727 ms, (p50: 4500 ms, p90: 7600 ms, p99: 8100 ms), latency samples: 240760
4. upgrading second batch to new version: 71c5af9da3f392b4613bfb72bc954c6523bcc06d
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 6551.624561339234 txn/s, latency: 4334.36488434056 ms, (p50: 3300 ms, p90: 10300 ms, p99: 14300 ms), latency samples: 165140
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 7618.946405821841 txn/s, latency: 4773.207625969722 ms, (p50: 4000 ms, p90: 8700 ms, p99: 13300 ms), latency samples: 265540
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 71c5af9da3f392b4613bfb72bc954c6523bcc06d passed
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 71c5af9da3f392b4613bfb72bc954c6523bcc06d

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 71c5af9da3f392b4613bfb72bc954c6523bcc06d (PR)
Upgrade the nodes to version: 71c5af9da3f392b4613bfb72bc954c6523bcc06d
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1082.5939301920412 txn/s, submitted: 1083.8601219232598 txn/s, failed submission: 1.2661917312187616 txn/s, expired: 1.2661917312187616 txn/s, latency: 3643.2499883040937 ms, (p50: 2100 ms, p90: 8700 ms, p99: 14200 ms), latency samples: 85500
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1016.2379759258243 txn/s, submitted: 1017.9938946747243 txn/s, failed submission: 1.7559187488999126 txn/s, expired: 1.7559187488999126 txn/s, latency: 3218.7181965442765 ms, (p50: 2100 ms, p90: 6300 ms, p99: 15100 ms), latency samples: 92600
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 71c5af9da3f392b4613bfb72bc954c6523bcc06d passed
Upgrade the remaining nodes to version: 71c5af9da3f392b4613bfb72bc954c6523bcc06d
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1123.0363090151068 txn/s, submitted: 1125.323552006991 txn/s, failed submission: 2.2872429918841277 txn/s, expired: 2.2872429918841277 txn/s, latency: 2951.7696537678207 ms, (p50: 2100 ms, p90: 5700 ms, p99: 12900 ms), latency samples: 98200
Test Ok

@vusirikala vusirikala merged commit 681a54d into main Jul 24, 2024
90 of 91 checks passed
@vusirikala vusirikala deleted the satya/add_insertion_time branch July 24, 2024 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants