Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[forge] rust bindings for indexer/testnet deployer #14547

Merged
merged 17 commits into from
Oct 1, 2024

Conversation

rustielin
Copy link
Contributor

@rustielin rustielin commented Sep 6, 2024

Description

Extends the Forge k8s backend with k8s_deployer, which contains rust bindings for invoking Forge Deployers.

  • Primarily, the ForgeDeployerManager creates k8s jobs using the k8s API that run the deployers, containers that take in customization values and create Forge components in the cluster
  • For normal K8sSwarm setup before tests, spin up indexer via deployer after testnet is spun up, if specified.
  • Entrypoint from the python Forge wrapper. New env var FORGE_ENABLE_INDEXER

Also write a CLI utility for operators. It tests the whole forge-deployers flow e2e without having to run a test. It spins up the testnet and indexer the same way that the TestRunner does. Run this against any GCP project that has a K8s cluster set up with Forge, and also additional Forge Indexer resources

$ cargo run -p aptos-forge-cli -- operator create --namespace forge-rustielin-test --num-validators 1 --with-indexer

...

$ kubens forge-rustielin-test
$kubectl get pods

NAME                                              READY   STATUS      RESTARTS       AGE
aptos-node-0-fullnode-eforge218-0                 1/1     Running     0              10m
aptos-node-0-validator-0                          1/1     Running     0              10m
data-service-7b6c44cd4f-7snqv                     1/1     Running     0              9m57s
default-processor-eforge218-57fbc9f4d6-9jptc      1/1     Running     2 (8m9s ago)   9m49s
deploy-forge-indexer-eforge218-rq7mj              0/1     Completed   0              10m
deploy-forge-testnet-eforge218-jpztp              0/1     Completed   0              10m
events-processor-eforge218-7b878cf554-bzthh       1/1     Running     2 (8m9s ago)   9m49s
events-processor-sdk-eforge218-548df6fbbd-5vjf4   1/1     Running     2 (8m9s ago)   9m49s
file-store-67cfff6748-vk4qw                       1/1     Running     0              9m57s
fullnode-eforge218-0                              1/1     Running     0              9m57s
genesis-aptos-genesis-eforge218-r62bs             0/1     Completed   0              10m
indexer-cache-worker-0-6d6c794f5c-nkpqb           1/1     Running     0              9m57s
postgres-eforge218-0                              1/1     Running     0              9m49s
redis-eforge218-5b544b6f9d-nvdhz                  1/1     Running     0              9m57s

Misc changes:

  • Make the kube_apis more DRY by using generics
  • Improve the MockK8sResourceApi to track all resources of a particular type in memory, to test more complex cases
  • Instead of passing various Forge feature flags as GHA workflow call/dispatch inputs, use a "dot-env" file forge.env. testsuite/run_forge.sh will source this file before running. Folks wanting to run highly customized adhoc Forge runs may commit their customizations to forge.env to a PR branch and run the workflow from there. NOTE: there are no protections at the moment about preventing changes to forge.env to land in main.

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Other (specify)

How Has This Been Tested?

CI, unit tests, manually via forge operator commands

Canary via adhoc Forge running the workflow from this branch rustielin/forge-indexer-canary : #14577

Key Areas to Review

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link

trunk-io bot commented Sep 6, 2024

⏱️ 1h 42m total CI duration on this PR
Slowest 15 Jobs Cumulative Duration Recent Runs
rust-cargo-deny 15m 🟩🟩🟩🟩🟩 (+4 more)
general-lints 13m 🟥🟥🟥🟥🟥 (+4 more)
rust-move-tests 10m 🟩
check-dynamic-deps 9m 🟩🟩🟩🟩🟩 (+4 more)
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-move-tests 6m
rust-move-tests 6m
rust-move-tests 5m
semgrep/ci 3m 🟩🟩🟩🟩🟩 (+4 more)
file_change_determinator 2m 🟩🟩🟩🟩🟩 (+4 more)
file_change_determinator 2m 🟩🟩🟩🟩🟩 (+4 more)
rust-move-tests 2m 🟩
rust-move-tests 2m 🟩

settingsfeedbackdocs ⋅ learn more about trunk.io

@rustielin rustielin changed the title [WIP][forge] rust bindings for forge indexer/testnet deployer [WIP][forge] rust bindings for indexer/testnet deployer Sep 10, 2024
@rustielin rustielin force-pushed the rustielin/forge-indexer branch 5 times, most recently from cd07380 to 2261509 Compare September 11, 2024 22:10
@rustielin rustielin changed the title [WIP][forge] rust bindings for indexer/testnet deployer [forge] rust bindings for indexer/testnet deployer Sep 12, 2024
RetryableError(String),
FinalError(String),
}

async fn create_namespace(
/// Does the same as create_namespace and handling the 409, but for any k8s resource T
pub async fn maybe_create_k8s_resource<T>(
Copy link
Contributor

@perryjrandall perryjrandall Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can actually use Dynamic object here and a nice little macro to get the api that you want like so

Then you can get rid of all the template T logic

macro_rules! kube_api {
    ($client:expr, $t:ty) => {{
        let tm = kube::api::TypeMeta::resource::<$t>();
        let gvk = kube::api::GroupVersionKind::try_from(&tm).unwrap();
        let ar = kube::api::ApiResource::from_gvk(&gvk);

        kube::api::Api::<kube::api::DynamicObject>::all_with($client.clone(), &ar)
    }};
    ($client:expr, $t:ty, $namespace:expr) => {{
        let tm = kube::api::TypeMeta::resource::<$t>();
        let gvk = kube::api::GroupVersionKind::try_from(&tm).unwrap();
        let ar = kube::api::ApiResource::from_gvk(&gvk);

        kube::api::Api::<kube::api::DynamicObject>::namespaced_with(
            $client.clone(),
            &$namespace,
            &ar,
        )
    }};
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline. We'll do this in another PR

@@ -0,0 +1,6 @@
# This file is source-d when running Forge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful and clear!

lets be sure to document this so that people know they can use this nice new interface

NOTE: we need to prevent people from modifying and committing this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be .gitignore this file but force add it to this PR, so other's cannot commit it unintentionally in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do this earlier but couldn't figure out an elegant solution. The issue is that we want folks to maybe modify and commit this (for their own tests), but never land it.

In run_forge.sh it fails forge after all the tests run, if there's any env vars set here, so I think that should cover all those cases.

@perryjrandall
Copy link
Contributor

I put some of my thoughts into implementation here, especially around the dynamic kube api bit there were some tricky parts but I think overall its pretty clean, we can go over it monday with @aluon and @yzaccc

https://github.com/aptos-labs/aptos-core/pull/14638/files

@rustielin rustielin marked this pull request as ready for review September 17, 2024 17:31
Copy link
Contributor

@perryjrandall perryjrandall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve with comment

Remove the ForgeDeployConfig struct and era from deployermanager since we dont actually need those, let the other side figure things out

The api suggested changes I will make in a follow up PR since I was noodling on this

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@@ -31,11 +31,10 @@ on:
required: false
type: string
description: The Forge k8s cluster to be used for test
FORGE_ENABLE_HAPROXY:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this option?

Copy link
Contributor Author

@rustielin rustielin Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make space for an indexer option. workflow_dispatch has a maximum of 10 inputs unfortunately. We can continue to test HAProxy in continuous forge if necessary, but we can probably remove it from adhoc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦🏽 I see.

@@ -0,0 +1,6 @@
# This file is source-d when running Forge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be .gitignore this file but force add it to this PR, so other's cannot commit it unintentionally in the future.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

} else {
false
/// Check if the stateful set labels match the given labels
fn stateful_set_labels_matches(sts: &StatefulSet, labels: &BTreeMap<String, String>) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle truncation to 63 characters here, incase we have longer labels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably won't hit this case since usually we don't autogenerate labels the same way we do with names, since we generally stick with labels for app/instance, etc. Doesn't hurt to add a check I suppose, though

@rustielin rustielin enabled auto-merge (squash) October 1, 2024 03:53

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

github-actions bot commented Oct 1, 2024

✅ Forge suite compat success on 7ef01a26f8d8a38610e3d364b722df517c970749 ==> 7a8c4eddeeb4f7a74d68a9e30e8b7c5977ef6dbd

Compatibility test results for 7ef01a26f8d8a38610e3d364b722df517c970749 ==> 7a8c4eddeeb4f7a74d68a9e30e8b7c5977ef6dbd (PR)
1. Check liveness of validators at old version: 7ef01a26f8d8a38610e3d364b722df517c970749
compatibility::simple-validator-upgrade::liveness-check : committed: 11870.91 txn/s, latency: 2494.28 ms, (p50: 2100 ms, p70: 2600, p90: 3900 ms, p99: 7900 ms), latency samples: 460500
2. Upgrading first Validator to new version: 7a8c4eddeeb4f7a74d68a9e30e8b7c5977ef6dbd
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7372.85 txn/s, latency: 3786.40 ms, (p50: 4200 ms, p70: 4400, p90: 4600 ms, p99: 4800 ms), latency samples: 138540
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 7179.69 txn/s, latency: 4490.17 ms, (p50: 4600 ms, p70: 4800, p90: 5200 ms, p99: 5300 ms), latency samples: 241480
3. Upgrading rest of first batch to new version: 7a8c4eddeeb4f7a74d68a9e30e8b7c5977ef6dbd
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 5963.24 txn/s, latency: 4596.89 ms, (p50: 5000 ms, p70: 5200, p90: 6200 ms, p99: 6600 ms), latency samples: 123680
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6462.48 txn/s, latency: 4934.77 ms, (p50: 5200 ms, p70: 5500, p90: 6700 ms, p99: 7100 ms), latency samples: 217880
4. upgrading second batch to new version: 7a8c4eddeeb4f7a74d68a9e30e8b7c5977ef6dbd
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 11488.57 txn/s, latency: 2366.08 ms, (p50: 2600 ms, p70: 2700, p90: 2900 ms, p99: 3000 ms), latency samples: 201680
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 12126.59 txn/s, latency: 2577.75 ms, (p50: 2500 ms, p70: 2600, p90: 2800 ms, p99: 3200 ms), latency samples: 389420
5. check swarm health
Compatibility test for 7ef01a26f8d8a38610e3d364b722df517c970749 ==> 7a8c4eddeeb4f7a74d68a9e30e8b7c5977ef6dbd passed
Test Ok

Copy link
Contributor

github-actions bot commented Oct 1, 2024

✅ Forge suite framework_upgrade success on 7ef01a26f8d8a38610e3d364b722df517c970749 ==> 7a8c4eddeeb4f7a74d68a9e30e8b7c5977ef6dbd

Compatibility test results for 7ef01a26f8d8a38610e3d364b722df517c970749 ==> 7a8c4eddeeb4f7a74d68a9e30e8b7c5977ef6dbd (PR)
Upgrade the nodes to version: 7a8c4eddeeb4f7a74d68a9e30e8b7c5977ef6dbd
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1169.60 txn/s, submitted: 1172.06 txn/s, failed submission: 2.46 txn/s, expired: 2.46 txn/s, latency: 2555.16 ms, (p50: 2100 ms, p70: 2700, p90: 3900 ms, p99: 6800 ms), latency samples: 104640
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1192.53 txn/s, submitted: 1194.97 txn/s, failed submission: 2.44 txn/s, expired: 2.44 txn/s, latency: 2480.13 ms, (p50: 2100 ms, p70: 2400, p90: 4500 ms, p99: 6600 ms), latency samples: 107520
5. check swarm health
Compatibility test for 7ef01a26f8d8a38610e3d364b722df517c970749 ==> 7a8c4eddeeb4f7a74d68a9e30e8b7c5977ef6dbd passed
Upgrade the remaining nodes to version: 7a8c4eddeeb4f7a74d68a9e30e8b7c5977ef6dbd
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1155.21 txn/s, submitted: 1157.20 txn/s, failed submission: 1.99 txn/s, expired: 1.99 txn/s, latency: 2598.55 ms, (p50: 2400 ms, p70: 2700, p90: 4200 ms, p99: 6200 ms), latency samples: 104540
Test Ok

Copy link
Contributor

github-actions bot commented Oct 1, 2024

✅ Forge suite realistic_env_max_load success on 7a8c4eddeeb4f7a74d68a9e30e8b7c5977ef6dbd

two traffics test: inner traffic : committed: 12344.84 txn/s, submitted: 12349.15 txn/s, expired: 4.31 txn/s, latency: 3200.22 ms, (p50: 2700 ms, p70: 3000, p90: 3600 ms, p99: 10800 ms), latency samples: 4693800
two traffics test : committed: 100.04 txn/s, latency: 1732.60 ms, (p50: 1500 ms, p70: 1700, p90: 1800 ms, p99: 8000 ms), latency samples: 1980
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.261, avg: 0.234", "QsPosToProposal: max: 1.055, avg: 1.019", "ConsensusProposalToOrdered: max: 0.339, avg: 0.327", "ConsensusOrderedToCommit: max: 0.957, avg: 0.626", "ConsensusProposalToCommit: max: 1.280, avg: 0.953"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.99s no progress at version 1669421 (avg 0.23s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 7.57s no progress at version 1669419 (avg 7.31s) [limit 15].
Test Ok

@rustielin rustielin merged commit a7ef111 into main Oct 1, 2024
61 of 95 checks passed
@rustielin rustielin deleted the rustielin/forge-indexer branch October 1, 2024 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants